|Summary:||org.apache.jasper.runtime.BodyContentImpl causing huge memory allocations|
|Product:||Tomcat 8||Reporter:||Brian Remmington <brian.remmington>|
|Component:||Jasper||Assignee:||Tomcat Developers Mailing List <dev>|
Proposed changes to BodyContentImpl
Test case to compare default and suggested implementation
Description Brian Remmington 2007-11-21 06:23:58 UTC
BodyContentImpl buffers all output from a custom tag ready to be written when the tag execution ends. However, the way in which it grows this buffer is extremely inefficient and has two undesirable effects: - garbage collection is triggered very frequently to tidy up the waste. - CPU load ramps up as large, unnecessary array copies take place. All that's needed is a more intelligent buffer-management algorithm. I have rewritten this class and can forward it if that would be useful (can't see a way of attaching it here).
Comment 1 Mark Thomas 2007-11-23 12:11:42 UTC
There is an option to attach patches just above comment box on the bugzilla page for each bug.
Comment 2 Brian Remmington 2008-02-08 07:33:57 UTC
Created attachment 21500 [details] Proposed changes to BodyContentImpl The attachment "buzilla43925.jar" contains a new implementation of the BodyContentImpl class (named "NewBodyContentImpl") and a new class on which it depends named "CharBuffer". I have also included two JUnit test cases. This new implementation performs 45% faster than the current implementation on my test machine and is substantially more efficient in memory usage, causing no character arrays to be left to the garbage collector unnecessarily. I would greatly appreciate it if people would review this code, and let me have any comments or suggestions.
Comment 3 Remy Maucherat 2008-02-10 18:20:23 UTC
Ok, so I not sure I got everything, but: - I don't see where memory usage is bounded in the code (if it's not, you should compare performance with the unbounded mode of Jasper, which will never reallocate anything); that's a problem - The opportunity is the ability to share the buffer list per thread, so the CharBuffer.bufList should be ThreadLocal<LinkedList> (and should be limited in size, which should solve item 1); as a configuration option, it could be a concurrent static structure, for use with thread intensive connectors - CharBuffer.toArray is expensive (no other option, though), so you should try to show a (real) test result
Comment 4 Brian Remmington 2008-02-12 01:44:29 UTC
Thanks very much, Remy. > - I don't see where memory usage is bounded in the code (if it's not, you should > compare performance with the unbounded mode of Jasper, which will never > reallocate anything); that's a problem Perhaps you could help me here. The memory usage doesn't appear to be bounded in the current implementation of BodyContentImpl which *always* appears to reallocate exponentially-larger blocks (in the reAllocBuff method at the bottom). I must be missing something, but am not sure what. I'm looking at the head of 5.5, which is the version affected by this change. > - The opportunity is the ability to share the buffer list per thread, so the > CharBuffer.bufList should be ThreadLocal<LinkedList> (and should be limited in > size, which should solve item 1); as a configuration option, it could be a > concurrent static structure, for use with thread intensive connectors This would require CharBuffer objects to be pooled to be of use. At the moment they aren't (and neither are BodyContentImpl objects, as far as I can see). If I understand you correctly, this would have the undesirable effect of holding onto blocks of memory once allocated rather than making them available for garbage collection. This would certainly require a bounded memory pool approach, but its debatable whether that would be preferable. It would certainly be a bigger change. > - CharBuffer.toArray is expensive (no other option, though), so you should try > to show a (real) test result I will put together a suitable test case for this. Is there a set of test cases for the current implementation? If so I could just add new ones to it. Regards Brian
Comment 5 Remy Maucherat 2008-02-14 05:35:08 UTC
About the upper memory bound, I had missed the clear method. I thought the original purpose was to keep the list of buffers around, now it looks to me the memory usage is equivalent to the LIMIT_BUFFER mode.
Comment 6 Thomas Jonsson 2008-04-02 23:42:44 UTC
What is the status of this "bug"? This causes problems in our jsf based application with a couple of hundreds users. I tried Brians fix with no improvement, memory is still occupied and not gc:d. However, I deployed the application in Glassfish and there the charcater array gets gc:d. Any ideas what to do?
Comment 7 Tim Funk 2008-04-03 04:07:15 UTC
*** This bug has been marked as a duplicate of bug 37793 ***
Comment 8 Brian Remmington 2008-04-06 00:10:46 UTC
This is not a duplicate of 37793. That one is related to the unbounded growth of cb without the ability for the garbage collector to kick in to collect it. The LIMIT_BUFFER option goes some way to resolve that problem. Thomas (comment #6) does seem to be suffering from 37793 rather than this one. *This* problem, however, is about the way in which, while cb grows, large numbers of exponentially-larger char arrays are reallocated, copied, and discarded. Every time chars are written that cause cb to grow, the following process occurs: a. Allocate a new char array that is twice as large as the current "cb"; b. Copy the contents of cb into the new array; c. Assign the new array to cb; d. Garbage collector collects the old cb array. This, I think, is obviously inefficient. It fills up the heap with large, unwanted character arrays and forces the garbage collector to activate more frequently, thus reducing the CPU available for doing useful work. It also uses CPU to copy exponentially-growing arrays from one block of memory to another. It is all unnecessary. My suggested resolution is to improve the way this buffering is managed. No array copies occur. No unnecessarily large heap allocations occur. Nothing is left to the garbage collector that doesn't have to be. The performance of this part of the code is improved. The drain that this code places on system resources is reduced. Given that widespread use of various tag libraries, I still feel that this is a very worthwhile change. Regards Brian
Comment 9 Mark Thomas 2011-12-20 20:35:50 UTC
This Tomcat 5 enhancement request has been moved to Tomcat 7 (the latest version) since Tomcat 5 development is limited and focussed on bugs and security issues whereas Tomcat 7 is still seeing new feature development.
Comment 10 Issa Gorissen 2014-06-23 15:29:42 UTC
Hi, For this issue, - is there already a chosen new architecture for the new version of the buffering ? or - should it be discussed ? or - you let the numbers speak for any new implementation ? I'm willing to spend time on this for Tomcat 7, so answers from Tomcat devs are expected. Thx -- Issa
Comment 11 Mark Thomas 2014-06-23 20:16:54 UTC
I don't recall any solutions being discussed but you should search the archives and review any discussions that there may have been about this feature. It is up to you whether to discuss it first. Generally, the more complex and invasive the change the more useful early discussion is. A patch for this enhancement will be judged like any other. It isn't just about raw performance we also take into account the complexity of the resulting code, the invasiveness of the patch, the usefulness of the feature and so on. Larger changes are best reviewed as a series of smaller patches. See my WebSocket compression changes earlier today as an example of breaking a large patch down into a series of smaller patches to make it easier to review. Note that you should provide patches against tomcat trunk and they will then be back-ported to the release branches as appropriate.
Comment 12 Issa Gorissen 2014-06-27 14:48:02 UTC
Brian, I've taken a look into the code you've submitted but I failed to understand how this code reduce the number of array copies in comparison with Tomcat's current implementation. I've ran the test cases with different values for the iteration and can't see real improvements. Calls to write(String)  will always induce a array copy, unfortunately, because they are immutable. Calls to write(char)  are not improvable because we don't know if the caller will reuse the char array - so we need to copy it. I don't know if calls to  are more frequent in a web app than calls to . If it is the case, then trying to avoid the array copy from the String would be good overall. In fact, having to manage both char and Strings makes it harder to improve the whole class! So as it seems that reducing the number of array copies is difficult while calling write() (even doable ?), maybe improving the way the buffer grows is the way to go - reducing the number of array copies and garbage creation. Any comment on my reflections ?
Comment 13 Brian Remmington 2014-06-27 17:44:57 UTC
Hi Issa It's been a while since I raised this (6 years), so I had to refresh my memory. I've just checked out the trunk of Tomcat 8, and the same problem exists, and I still think my code addresses it. Your final sentence describes exactly what my suggested solution does: it improves the way the buffer grows to reduce array copies and reduce garbage collection. Your other comments don't relate to my proposal. In the current Tomcat class (org.apache.jasper.runtime.BodyContentImpl), the problem is with the method reAllocBuff(int) which is right at the bottom of the class. This is called when the buffer array (cb) needs to grow, and it's terribly inefficient. My code replaces cb with an instance (named "buffer") of a new class CharBuffer which manages a LinkedList of char arrays to ensure that, upon growth (CharBuffer.grow), no array copy occurs, and nothing is left to the garbage collector.
Comment 14 Brian Remmington 2014-06-28 10:37:36 UTC
I've just run the tests I provided again against Tomcat 8, ramping up the iterations in the BodyContentTest steadily to 1500000 (hardware is *so* much better now...). My version is consistently 40% faster than the current version, however both are pretty quick these days. My machine can chew through 1000000 iterations in 1378ms with the current version (BodyContentImpl) and 824ms with my version (NewBodyContentImpl). Speed isn't the main point of this, however. It's the very wasteful use of memory that was causing the real problem when I initially raised this issue. The garbage collector was kicking in constantly, completely draining CPU cycles. For example, take the iterations up to 1500000 on my test machine, and the current version blows up with an OutOfMemory error. My version continues to work and scales fairly linearly (1341ms). Take it up further to 2000000 iterations, and mine starts to plateau (2393ms), but it still continues to work. At 3000000 iterations, my version completes in 4543ms.
Comment 15 Issa Gorissen 2014-07-01 09:35:09 UTC
I made some changes to the test case you provided to excluded the random (so that the random numbers are always the same for both cases) and run the test for one target only. JVM is Java HotSpot(TM) 64-Bit Server VM (25.5-b02) for linux-amd64 JRE (1.8.0_05-b13), ran with settings -server -Xloggc:gc.log -Xmx4G. I don't get the same results as yours and GC is running a lot more on your class than Tomcat's. (I can send you the results if you'd like) It's true I can increase the number of iteration with your class before getting a OOE more than with Tomcat's. But is this test case emulating real usage ? My next tests will be to run both class under Tomcat with a test case of several pages with tags triggered by jmeter. This will take me some time so I will not be back before several weeks. Thx
Comment 16 Issa Gorissen 2014-07-22 10:07:26 UTC
Created attachment 31841 [details] Test case to compare default and suggested implementation I've ran some tests with JMeter 2.11 against some small custom tags running under Tomcat 8.0.9. The results from the Jmeter test in file show that the default implementation is a little bit faster than suggested one (number of jmeter samples being higher for the same amout of running time). Can someone run the tests I provided and report the results ? If any improvement or suggestion is found for the tests I provide, please report as well. Thx!
Comment 17 Mark Thomas 2018-01-11 20:09:29 UTC
This is being addressed under the following pull request: https://github.com/apache/tomcat70/pull/4 I'm therefore marking this as resolved as I can't mark it as a duplicate of a PR.