Created attachment 38363 [details] Patch with one-line fix and relevant unit test On applications with large classpaths, enabling UseBloomFilterForArchives tremendously accelerates classpath scans by performing early exits on each jar, rather than accessing the manifest. Unfortunately the underlying Bloom filters can be cleaned at runtime by AbstractArchiveResourceSet.gc(), which forces the next classpath scan to reinitialize the entire set of bloom filters. On our large application we are therefore observing both huge improvements to startup, and huge cpu regressions while handling traffic. I've attached a patch containing a one-line fix, as well as a unit test that detects the error.
This patch essentially trades memory for performance. We have some users that won't want to make that trade - even if the memory concerned is relatively small. Given the competing demands here, I think we need to find a way to make this configurable. Whether that is by expanding useBloomFilterForArchives beyond a simple boolean, by tweaking the background processing frequency or by a new configuration option is TBD. I want to mull this over a little bit before making any changes. Thoughts?
I had a hand in the original implementation, and the intent was always to create the Bloom filters once and keep them around forever. In our internal implementation (Tomcat 8) we did exactly that. Relative to the intent, this is a bug. However, you have a valid point that the current state may be acceptable for some users, and changing it could cause challenges. FWIW the memory footprint is quite small - we saw ~1.5MB of heap to accommodate ~4000 jars. I'll defer to your judgment. Adding another configuration parameter is certainly doable.
(In reply to Mark Thomas from comment #1) > This patch essentially trades memory for performance. We have some users > that won't want to make that trade - even if the memory concerned is > relatively small. > > Given the competing demands here, I think we need to find a way to make this > configurable. Whether that is by expanding useBloomFilterForArchives beyond > a simple boolean, by tweaking the background processing frequency or by a > new configuration option is TBD. I want to mull this over a little bit > before making any changes. > > Thoughts? +1 true = use the bloom filter as exists today false = do not use the bloom filter wankel = use the bloom filter with cache-purges I think it makes sense to re-name/alias the current setting to something more clear such as "archiveIndexing" with values like "simple", "bloom" and now "purged-bloom" or whatever.
> I think it makes sense to re-name/alias the current setting to something > more clear such as "archiveIndexing" with values like "simple", "bloom" and > now "purged-bloom" or whatever. I like this suggestion.
As the set of entries in Jar is known in advance, and the entries are not modified, Xor Filter might be better. See http://web.stanford.edu/class/archive/cs/cs166/cs166.1216/lectures/13/Slides13.pdf#page=49 https://github.com/FastFilter/xorfilter
(In reply to Vladimir Sitnikov from comment #5) > As the set of entries in Jar is known in advance, and the entries are not > modified, Xor Filter might be better. > > See > http://web.stanford.edu/class/archive/cs/cs166/cs166.1216/lectures/13/ > Slides13.pdf#page=49 > > https://github.com/FastFilter/xorfilter Thank you for the links! It appears to me that Xor has lower memory usage but increases the cost of a lookup. For our applications, that is the wrong tradeoff, however it may work as another value of "archiveIndexing".
> It appears to me that Xor has lower memory usage but increases the cost of a lookup Would you please clarify why you expect Xor filters to have a slower lookup? See https://arxiv.org/pdf/1912.08258.pdf Table 3. Membership-test benchmark results, 25% find. Timings are in nanosecond per query. Xor filters might be faster than Bloom filters. The key tradeoff is that Bloom filter allow "adding more elements to the set", while Xor filter does not support "add(...)" command.
Sure - looking at the specific code in https://github.com/FastFilter/xorfilter/blob/master/xorfilter.go vs https://github.com/apache/tomcat/blob/9.0.x/java/org/apache/catalina/webresources/JarContents.java, the Bloom filter uses two hashes and two bitwise lookups; the Xor filter uses three hashes and three reduce operations. Two hashes vs three is an implementation detail but the extra work of the reduces, although small, may come down to the number of instructions required. I don't have the expertise to address that point, but at a minimum there are more method calls and any associated overhead. As I reread the code though, I notice that the current JarContents code uses a fixed-length array of bits, and the Xor filter uses a variable length array of uint8. Fixed vs. variable length can be addressed, but bit vs uint8 certainly does. For a given number of buckets, the JarContents implementation will therefore be far smaller in memory. I think? However, this conversation isn't tied to the bug/feature/behavior about memory cleanup. Might be interesting to chase this separately as an enhancement.
Sounds like we have general agreement on updating the configuration switch to contain more than just true/false, so the remaining questions are: 1. Do we indeed agree? 2. Should I do the updated patch, or is one of the committers available to do it? I'm willing. Thanks! John
Thanks for the offer of a patch. Please go ahead.
I'm working with John on the updated patch. I'm going to go ahead with updating the "UseBloomFilterForArchives" setting to something similar to what Christopher mentioned: > I think it makes sense to re-name/alias the current setting to something > more clear such as "archiveIndexing" with values like "simple", "bloom" and > now "purged-bloom" or whatever. Do we know of a similar change that was made in the past?
A related issue: it should probably be on the WebResourceRoot rather than the context. trimSpaces for JSPs did something similar (boolean -> String).
I'm worried that adding a new switch on WebResourceRoot that depends on the old switch on Context (for backwards-compatibility) could confuse the config for applications
(In reply to Rahul Jaisimha from comment #13) > I'm worried that adding a new switch on WebResourceRoot that depends on the > old switch on Context (for backwards-compatibility) could confuse the config > for applications We made similar changes in the past (between 7.0.x and 8.0.x) for allowLinking and friends. If we follow the same pattern, the setting would move to WebResourceRoot for 10.1.x but remain on the Context for 10.0.x and earlier.
Here's what I'll do: For 9.0.x and 10.0.x, I'll add a third option to the existing switch "useBloomFilterForArchives" and support the following options: ["true", "false", "purged"]. This is similar to what was done with trimSpaces: https://github.com/apache/tomcat/commit/b527e70e3089c1e3fe499439ef8838e8a3730f36#diff-ead8d9adcc700eb19d039da0cf41ec2806ac11a669a575129176b0ae1ba52a84 For 10.1.x, I'll remove "useBloomFilterForArchives" from Context and add "archiveIndexing" in WebResourceRoot which supports the following options: ["simple", "bloom", "purged-bloom"]. Is that ok?
(In reply to Rahul Jaisimha from comment #15) > Here's what I'll do: For 9.0.x and 10.0.x, I'll add a third option to the > existing switch "useBloomFilterForArchives" and support the following > options: ["true", "false", "purged"]. This is similar to what was done with > trimSpaces: > > https://github.com/apache/tomcat/commit/ > b527e70e3089c1e3fe499439ef8838e8a3730f36#diff- > ead8d9adcc700eb19d039da0cf41ec2806ac11a669a575129176b0ae1ba52a84 > > For 10.1.x, I'll remove "useBloomFilterForArchives" from Context and add > "archiveIndexing" in WebResourceRoot which supports the following options: > ["simple", "bloom", "purged-bloom"]. > > Is that ok? I apologize, I just want to update my previous comment. The supported options for 9.0.x and 10.0.x for "useBloomFilterForArchives" will be ["true", "false", "retained"]. True: purged bloom filter False: no bloom filter Retained: Updated bloom filter without purging
+1
I want to add an IllegalArgumentException to the setter if the input value is unrecognized for ["true", "false", "retained"]. Assuming this is the right thing to do, how do I get the localized/internationalized exception message to use for this?. I can update LocalStrings.properties but I don't know how to update LocalStrings_es.properties or LocalStrings_de.properties for the new exception message.
You don't need to. Volunteers will complete those translations over time. For languages without a translation, the English version will be used.
We've realized the proposed change (updating useBloomFilterForArchives from boolean to String) is a breaking change, since the setter for this value is exposed publically. The two options to get around this and maintain backward-compatibility are: 1. Mark the boolean Setter as Deprecated and add a String Setter. The Getter would remain a boolean and return whether a bloom filter is to be used (true for "TRUE" and "RETAINED"). A second getter would have to be exposed to understand if useBloomFilterForArchives is "RETAINED". This could also be boolean. 2. Maintain the original boolean "useBloomFilterForArchives" and add a second boolean parameter "retainBloomFilter". The logic for re-initializing the bloom filter would depend on both parameters. "retainBloomFilter" will default to false and will maintain the existing behavior for "useBloomFilterForArchives". The existing behavior will also be retained if "useBloomFilterForArchives" is false. Only if "retainBloomFilter" is true and "useBloomFilterForArchives" is true, then will the bloom filter avoid re-initialization. For 10.1.x we will maintain the same solution, since this can be non-backwards-compatible: > For 10.1.x, I'll remove "useBloomFilterForArchives" from Context and add "archiveIndexing" in WebResourceRoot which supports the following options: ["simple", "bloom", "purged-bloom"].
10.1.x is now stable so the plan needs to account for that too. Given the context how about this: - leave useBloomFilterForArchives as a boolean on the Context - deprecate this for all versions (to be removed in Tomcat 11) - add archiveIndexing as a String attribute on WebResourceRoot
(In reply to Mark Thomas from comment #21) > 10.1.x is now stable so the plan needs to account for that too. > > Given the context how about this: > > - leave useBloomFilterForArchives as a boolean on the Context > - deprecate this for all versions (to be removed in Tomcat 11) > - add archiveIndexing as a String attribute on WebResourceRoot I like this solution. In that case, "useBloomFilterForArchives" will be honored if "archiveIndexing" is "false" (default). If "archiveIndexing" is set (any value other than "false"), then it will override "useBloomFilterForArchives". Agree with this behavior?
Created attachment 38402 [details] Updated patch that makes the bloom filter purging configurable
Created attachment 38403 [details] Updated patch that makes the bloom filter purging configurable - Tomcat9
I've attached the updated patch diffs that contain the configurable "archiveIndexStrategy" on WebResourceRoot as discussed. The diff is slightly different in tomcat9.0.x due to some file name changes between 9.0.x and 10.0.x. 10.0.x and 10.1.x diffs look the same.
Thanks for the patches. I applied a slightly modified version. Fixed in: - 10.1.x for 10.1.2 onwards - 9.0.x for 9.0.69 onwards
I was looking over the changes, and should AbstractSingleArchiveResourceSet also rely upon the same configuration honor the retain and bloom filter setting? It makes for inconsistent behavior if only one of the ArchiveResourceSet classes uses it.
(In reply to gabriel.hollies from comment #28) > I was looking over the changes, and should AbstractSingleArchiveResourceSet > also rely upon the same configuration honor the retain and bloom filter > setting? > > It makes for inconsistent behavior if only one of the ArchiveResourceSet > classes uses it. Why do you think it doesn't?