Bug 48760 - Tomcat breaks in serving large files (>=1.8MB) under high load and high threading.
Summary: Tomcat breaks in serving large files (>=1.8MB) under high load and high threa...
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 6
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 6.0.16
Hardware: Sun Solaris
: P2 major (vote)
Target Milestone: default
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-17 20:44 UTC by Stefan Kendall
Modified: 2014-02-17 13:56 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Kendall 2010-02-17 20:44:37 UTC
When serving a 1.8MB file, Tomcat 6.0.16->6.0.20 (all versions tested), will occasionally throw a BadFileNumber exception and only serve part of the file. The rate of occurrence is roughly a few times a day under high load, but we have stress tools which can reliably reproduce this issue, and multiple continuous wget connections can reproduce the issue, as well.

We are serving a large SWF application, although this may affect arbitrary binary data. To reproduce this issue, serve a large file (at least 1.8MB, in our case), and continuously request the file over a period of several hours to a few days; this issue happens intermittently, but it is reproducible.

This problem is especially critical to us, as we use apache as a cache in front of tomcat. For whatever reason, apache aggressively caches this bad data and never again requests the file, although this is probably another unrelated problem. In this configuration, however, after the file has gone bad, all users get served corrupt files, causing the application to break entirely. 

This was reported in 2008 here, but the issue was dismissed:
http://mail-archives.apache.org/mod_mbox/tomcat-dev/200808.mbox/%3Cbug-45601-78@https.issues.apache.org/bugzilla/%3E

The fix provided here works to resolve the issue, but our IT department cannot support a custom-rolled version of tomcat indefinitely. It's not a sustainable solution, and this issue should affect other users. We are not sure why the fix works, but it does resolve our completely reproducible scenario. Ideally, then, someone with more knowledge of tomcat could understand why this fix works and what, exactly, it's solving, and then integrate this into the main tomcat release line.

Here is how the error appears in the logs:
Feb 16, 2010 1:25:59 PM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet default threw exception
java.io.IOException: Bad file number
	at java.io.FileInputStream.readBytes(Native Method)
	at java.io.FileInputStream.read(FileInputStream.java:199)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
	at java.io.FilterInputStream.read(FilterInputStream.java:90)
	at org.apache.catalina.servlets.DefaultServlet.copyRange(DefaultServlet.java:1968)
	at org.apache.catalina.servlets.DefaultServlet.copy(DefaultServlet.java:1714)
	at org.apache.catalina.servlets.DefaultServlet.serveResource(DefaultServlet.java:809)
	at org.apache.catalina.servlets.DefaultServlet.doGet(DefaultServlet.java:325)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:209)
	at org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:347)
	at org.terracotta.modules.tomcat.tomcat_5_5.SessionValve55.invoke(SessionValve55.java:57)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
	at java.lang.Thread.run(Thread.java:619)
Comment 1 Remy Maucherat 2010-02-17 22:07:35 UTC
An exact duplicate already exists, so no need to open a new bug. Mark's comments are also perfectly sensible, don't you agree ?

*** This bug has been marked as a duplicate of bug 45601 ***
Comment 2 Stefan Kendall 2010-02-17 23:04:19 UTC
(In reply to comment #1)
> An exact duplicate already exists, so no need to open a new bug. Mark's
> comments are also perfectly sensible, don't you agree ?
> 
> *** This bug has been marked as a duplicate of bug 45601 ***

That bug was resolved invalid, and none of the response comments are helpful or correct in any way. This IS an issue, and it's reproducible. The original poster did not give a reproducible scenario, and I believe I have. As such, I believe that the previous ticket should be reopened with this exact test scenario as proof of the bug (high load, large binary file), or this ticket should stay open as it stands.

This bug is critically damaging our successful deployment under Tomcat, and it's something that needs to be resolved, rather than closed because those involved don't understand the problem. A reproducible bug is a bug whether it is understandable or not.
Comment 3 Mark Thomas 2010-02-17 23:17:09 UTC
(In reply to comment #2)
> This IS an issue, and it's reproducible.

Being reproducible is not sufficient, on its own, to make it a Tomcat bug. There are plenty of other possible causes. Please see https://issues.apache.org/bugzilla/show_bug.cgi?id=45601#c3 and follow the advice there.
Comment 4 Stefan Kendall 2010-02-17 23:33:50 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > This IS an issue, and it's reproducible.
> 
> Being reproducible is not sufficient, on its own, to make it a Tomcat bug.
> There are plenty of other possible causes. Please see
> https://issues.apache.org/bugzilla/show_bug.cgi?id=45601#c3 and follow the
> advice there.

The only advice is this:

"A Google search suggests that a lack of OS resources could also be a cause of
this error. Given that the environment is highly threaded, and taking this to
also mean highly loaded, this looks more like a JVM/OS issue to me."

Resource usage is next to nothing on the servers I've run against. Furthermore, even if this is an OS/JVM issue, will you simply drop all claims of support for Solaris? I've tested against multiple JVM versions, and this does not seem to affect the performance of Tomcat. This may only be an issue with Solaris, but if the price of Solaris support is a "a negligible performance impact", then why not implement the fix, noting that it only corrects the Solaris environment? Not integrating a fix because it "should work" is absurd from a practical, 'my-business-application-really-needs-to-work-or-I'll-lose-money' mindset.

You claim that the fix would exacerbate the issue, which it did not. Clearly, the issue is not well-enough understood to make such sweeping statements. I do not have the time to spend the next month researching the internals of Tomcat and Solaris to understand why this specific test scenario fails repeatedly. As an pragmatic person, I fail to understand why a clear solution for a popular server operating system backed by Sun with a slow release cycle with minimal negative impact would be rejected.

I claim that unless someone can prove that the fix is explicitly harmful, with actual test data, the fix should be implemented, as it has been proven explicitly helpful with actual test data.

Is this not a rational argument?
Comment 5 Remy Maucherat 2010-02-17 23:59:55 UTC
What the patch will do is that it disables the content cache, which is used for small files. As a result, only big files will be accessed, and apparently your great/mature/stable OS does not like that usage pattern.
Comment 6 Mark Thomas 2010-02-18 00:01:04 UTC
https://issues.apache.org/bugzilla/show_bug.cgi?id=45601#c3 sets out the circumstances under which the patch would be applied and points to a resource to help figure out the necessary information. If you choose not to go down that route, that is your choice.

Without an understanding of why the problem occurs and how the patch addresses it there is no certainty that the root cause will be fixed or that other users won't see regressions.

Given the patch does address it for you, there are ways to reduce the burden of running a patched Tomcat version to no more than a couple of minutes of additional work per upgrade. Again, the users list can help with this.
Comment 7 Stefan Kendall 2010-02-18 00:03:42 UTC
(In reply to comment #5)
> What the patch will do is that it disables the content cache, which is used for
> small files. As a result, only big files will be accessed, and apparently your
> great/mature/stable OS does not like that usage pattern.

I have no particular fondness for Solaris, but I'm not a fool. I understand that it's a major server OS, and for it to simply "not-be-supported" is completely unacceptable. If the benefits of the fix (removes complete breakage) outweighs the costs of implementing the fix (negligable performance loss, solvable with bigger hardware?), then I think the fix should be implemented.

Tell me how my argument is incorrect, please.
Comment 8 Stefan Kendall 2010-02-18 00:21:18 UTC
(In reply to comment #6)
> https://issues.apache.org/bugzilla/show_bug.cgi?id=45601#c3 sets out the
> circumstances under which the patch would be applied and points to a resource
> to help figure out the necessary information. If you choose not to go down that
> route, that is your choice.
> 
> Without an understanding of why the problem occurs and how the patch addresses
> it there is no certainty that the root cause will be fixed or that other users
> won't see regressions.
> 
> Given the patch does address it for you, there are ways to reduce the burden of
> running a patched Tomcat version to no more than a couple of minutes of
> additional work per upgrade. Again, the users list can help with this.

Mark, I will address your concerns in turn.

1.) This is clearly an issue *with Tomcat*. For every person that chooses to serve a large file through Tomcat, should they also lose months of productivity to obscure, hard to trace down bugs that have been documented and reproduced? As network connections and bandwidth continually improve, the chance that businesses choose to send down bigger, richer files is likely to increase (think larger images, swfs, etc.). As such, this is clearly an issue if Tomcat expects to continue being a viable Servlet container. Tomcat solves a number of problems for a number of people, but if it cannot support the growing trends in web development (more data, richer data), then surely it will fail. Failing to address this issue not only hurts the product as a whole, but it sets a dangerous precedent for all future bug finds.

2.) Would you like performance metrics with a week of constant load, or real world usage data to confirm my claims? I can provide all of that, but as I am not a Tomcat developer, I cannot give you a mathematical proof as to why the fix works. The original posters ideas were discarded, so I cannot say why exactly the fix works. I do know, however, that for two built deployments of Tomcat 1.6.16->latest, one with the fix and one without, I can reliably reproduce the problem on the unpatched version with load generation tools (and wget), and I cannot reproduce the problem on the patched version (even with  unrealistically high load).

3.) So then, you agree that there's a simple solution to the root cause of my problem, and it's a small change to the source base? As an application developer, I cannot suddenly become responsible for the entirety of the Tomcat code base. If you extend this scenario, any small problem *you* or *I* can not understand thoroughly should be self-maintained, correct? As problems increase, the reliability of the main Tomcat distribution would then become questionable. 

Conclusion:

I submit that unless the problem can be proven harmful (via whatever automated test systems currently exist, or real life usage data), then the patch should be implemented. If no such tests exist, or no one can prove the patch harmful to the stability of Tomcat, then why not implement? There is a strong business case to implement the patch, and there is a weak case against implementation. 

If this does cause regressions for any party, then they can log an issue as I did. Remember that the optimization has not always existed, so at some point, the implementation of the optimization DID cause regressions, but you just didn't see it. This optimization removal would be an undoing of a -problem- introduced into the tomcat source in 5.5, rather than an out-of-nowhere fix that may cause issues. Regressions weren't a concern in 5.5, and they shouldn't be a concern now, when real data is involved to prove the fix viable and the previous optimization unstable.
Comment 9 Remy Maucherat 2010-02-18 00:34:58 UTC
Une cacheObjectMaxSize="0" on the Context element and stop wasting your time posting nonsense ...
Comment 10 Stefan Kendall 2010-02-18 17:01:18 UTC
(In reply to comment #9)
> Une cacheObjectMaxSize="0" on the Context element and stop wasting your time
> posting nonsense ...

Remy, your comment does not make sense.

As per the documentation, files over 512KB are not cached (default setting). My file is 1.8MB large; I stated this in my initial bugpost.

Docs:
http://tomcat.apache.org/tomcat-6.0-doc/config/context.html

Furthermore, I even tried this supposed fix (although it made no sense given the documentation), and it did not work. The only fix for this issue, still, is to remove the "optimization" introduced in Tomcat 5-5.
Comment 11 Stefan Kendall 2010-02-18 20:02:19 UTC
gfuser9999@yahoo.co.uk has presented a reasoned explanation of what is happening below. Apparently, this is a problem that needs to be backported from Glassfish.

I've copied the e-mail below:

Now, really the REAL bug fix in Tomcat you need to port 
back from
 Glassfish is actually this 

http://fisheye5.cenqua.com/changelog/glassfish?cs=SJSAS91_FCS_BRANCH%3Asultal%3A20080904082511

This is the actual root cause for 45601 which is not
updated back to Tomcat bug entry in the late 2008 but 
this is the RCA and is fixed in SUN Glassfish Application
server (where they share the same common TC code).

--------------------------
Here's the salient point:
--------------------------
0. You notice the issue happens in DefaultServlet copyRange
   as all your submitters submit

- The code all do something like this
   InputStream resourceInputStream 
      = cacheEntry.resource.streamContent();

1. So they all take a FileResource/Resource from
   a cache entry  (which is looked up from some cache)

2. Now We know that this is a FileResource (static resource)

3. Finally most of the code will have this

   reader = new
 InputStreamReader(resourceInputStream,
                               fileEncoding);
   copyRange(reader,....) <-- do something
   reader.close();

4.
 Now, you notice that reader.close() will CLOSE
   the underlying FileInputStream when it is run.

5. Hence, there is a SUBTLE race in FileResource.streamContent
   where a TWO or more threads get the SAME InputStream.
   This is due to the FACT that multiple requests can
   get hold of the same cached entry and hence due to
   the BAD way the FileResource streamContent stuff the
   InputStream to a temp instance variable.. 

   Due to MT scheduling (it is possible to get a MT schedule
   where two threads gets the some common InputStream.)
   [Please see TC code to see why this can happen]

Testing?
------------
It is easy to ARTIFICIALLY reproduce & to confirm
You can slow things down in the
 FileResource.streamContent()
by putting delays and etc (or also printing
the resulting InputStream in the copy while
doing two or more parallel wget for the same
FileResource of filesize of some large MB to see.
Comment 12 Mark Thomas 2010-02-18 20:26:25 UTC
Bingo. That looks like the sort of analysis that is needed to move this forward. Re-opening to take a look at this potential race condition.

Note that we can't just copy the Glassfish patch without jumping through some legal licensing hoops that I'd prefer to avoid.
Comment 13 Mark Thomas 2010-03-09 13:40:52 UTC
Based in the analysis in comment 11, I can see a couple of places where there is a problem. These have been fixed in trunk and the patch proposed for 6.0.x and 5.5.x (http://svn.apache.org/viewvc?rev=920858&view=rev)

Stefan: Are you able to confirm that using this patch rather than the one in bug 45601 still fixes the issue for you? I remain concerned that you may be seeing a different issue as I can't see a reason, apart from just altering the execution timing, why the bug 45601 patch should fix / work around the issue described in comment 11.
Comment 14 Stefan Kendall 2010-03-09 15:12:46 UTC
(In reply to comment #13)
> Based in the analysis in comment 11, I can see a couple of places where there
> is a problem. These have been fixed in trunk and the patch proposed for 6.0.x
> and 5.5.x (http://svn.apache.org/viewvc?rev=920858&view=rev)
> 
> Stefan: Are you able to confirm that using this patch rather than the one in
> bug 45601 still fixes the issue for you? I remain concerned that you may be
> seeing a different issue as I can't see a reason, apart from just altering the
> execution timing, why the bug 45601 patch should fix / work around the issue
> described in comment 11.

The fix seemed to slow the issue, but further testing indicated that this problem was still caused with the patch in 45601. This may have been specific to the timing of our specific system; changing code willy-nilly would alter the frequency at which we would see issues. It's good to hear that a *true* fix has been proposed for 5.5.x/6.0.x .
Comment 15 Mark Thomas 2010-03-09 15:18:42 UTC
Thanks for that feedback on the patch in bug 45601. That does make it more likely the issue you are seeing is fixed by the patch in r920858 but you obviously need to test that yourself to be 100% sure.
Comment 16 Konstantin Kolinko 2010-03-10 22:45:45 UTC
The patch (r920858) was applied to Tomcat 5.5, will be in 5.5.29 onwards.
Comment 17 Stefan Kendall 2010-03-25 18:01:03 UTC
(In reply to comment #16)
> The patch (r920858) was applied to Tomcat 5.5, will be in 5.5.29 onwards.

r920858 definitively solves our problem. Is there an estimate for when this will be available in a stable release of 6.0.x?
Comment 18 Konstantin Kolinko 2010-03-25 18:27:40 UTC
(In reply to comment #17)
It is already proposed for 6.0 (see the STATUS file in svn). All that is needed is 1) third vote from some committer so that it can be applied, 2) wait for the release to happen, 3) test the release to ensure that it is stable.
Comment 19 Stefan Kendall 2010-03-25 18:33:34 UTC
(In reply to comment #18)
> (In reply to comment #17)
> It is already proposed for 6.0 (see the STATUS file in svn). All that is needed
> is 1) third vote from some committer so that it can be applied, 2) wait for the
> release to happen, 3) test the release to ensure that it is stable.

Good to know. Thanks!
Comment 20 Mark Thomas 2010-04-11 09:32:24 UTC
This has been fixed in 6.0.x and will be included in 6.0.27 onwards.
Comment 21 Matthew Mullins 2010-06-02 15:48:37 UTC
(In reply to comment #20)
> This has been fixed in 6.0.x and will be included in 6.0.27 onwards.

Is there an ETA on the release of 6.0.27?