Bug 57808

Summary:	Don't preload all charsets
Product:	Tomcat 8	Reporter:	Philippe Marschall <kustos>
Component:	Util	Assignee:	Tomcat Developers Mailing List <dev>
Status:	RESOLVED WONTFIX
Severity:	enhancement	CC:	fredrik
Priority:	P2	Keywords:	PatchAvailable
Version:	8.0.x-trunk
Target Milestone:	----
Hardware:	PC
OS:	Mac OS X 10.1
Attachments:	simple patch 2nd try at a patch

Description Philippe Marschall 2015-04-13 16:59:51 UTC

Created attachment 32646 [details]
simple patch

I analyzed several heap dumps and and for small applications it seems that Tomcat uses about twice as much memory as Jetty (roughly 20MB vs 10MB). Most of this seems to come from Charset instances. While the number of Charsets is low some of them (mostly EBCDIC and GB18030) have quite large coding tables.

The attached patch only preloads UTF-8 and ISO-8859-1, this saves about 5MB. That doesn't sound drastic but makes Tomcat to go from about 17.5MB to 12.5MB.

Due to the discussion in 51400 I decided not to cache look up failures.

I decided to use a ConcurrentHashMap so that readers never block. In theory a fully synchronized map could also be used as look up times should be low and therefore the lock should be uncontended most of the time.

See the attached heap dumps for applications before and after the patch.

Comment 1 Philippe Marschall 2015-04-13 17:04:29 UTC

The heap dumps would be around 7MB compressed so I can't attach them. Where should I post them?

Comment 2 Remy Maucherat 2015-04-13 17:10:49 UTC

In most cases, 5MB would be a trivial amount of memory, so how about using a flag with a system property for that (or something similar) ?

Comment 3 Konstantin Kolinko 2015-04-14 00:47:03 UTC

This preloading is necessary to avoid performance bottlenecks. (See history of that code, or search Bugzilla  -> bug 51400).

As such, it cannot be removed.

If one wants, it shall be possible to provide a user-configurable list of charsets as an opt-in feature (the same as suggested in comment 2), but Tomcat will behave as all the other charsets do not exist.

Comment 4 Philippe Marschall 2015-04-14 05:44:03 UTC

I have read bug 51400 and my understanding is that preloading is not necessary to avoid performance bottlenecks. Preloading was discussed but I couldn't find the reasons why preloading was ultimately chosen. The reasons for other implementation decisions I could find (not caching misses, not storing permutations, …).
My understanding is that avoiding calling (indirectly) Charset.forName() / Charset.lookup() during normal request processing is necessary to avoid performance bottlenecks. This is achieved by the caching even when avoiding the preloading.

Comment 5 Philippe Marschall 2015-04-19 18:55:15 UTC

Created attachment 32662 [details]
2nd try at a patch

Updated patch:
- keeps the current behavior as default
- new behavior is available and configurable through a system property
- document the system property

Comment 6 Mark Thomas 2015-04-20 20:04:57 UTC

(In reply to Philippe Marschall from comment #4)
> I have read bug 51400 and my understanding is that preloading is not
> necessary to avoid performance bottlenecks. Preloading was discussed but I
> couldn't find the reasons why preloading was ultimately chosen.

It means you don't have to cache the misses since misses have their own DoS potential.

I wonder if lazy-loading the cache with ISO-8859-1 and UTF-8 pre-loaded would give a sensible balance without adding another configuration option?

I'd feel more comfortable in these discussions with some numbers to quantify the performance impact.

Comment 7 Fredrik Jonson 2015-04-21 10:53:12 UTC

(In reply to Mark Thomas from comment #6)

> It means you don't have to cache the misses since misses have their own DoS
> potential.

Is it correct that there are two possible DOS attacks with dynamic charset loading, i.e with the patch applied:

 1. force permanent increased memory during runtime with requests that specify
    all avaliable but previously unloaded charsets.

 2. force expensive charset name/instance lookup misses with requests
    that specify non-existent charset names.

Any other?

The consequence of threat 1 is limited. It is self-exhausting as soon as all available charsets have been loaded once. One mitigation could be to use weak references, if it's worth it?

Threat 2 could be mitigated if charset loading is performed only when the charset name is valid. Preloading all available charset names on startup will block threat 2, right? Assuming of course that prefetching all charset names doesn't nullify the memory reduction.

BTW, assuming the proposal is accepted, I'd like to see dynamic loading as the new default, and if necessary an option to make tomcat fall back to the old pre-loading behaviour. Correct and lean should be the default, not an option.

(N.B I'm not a tomcat committer, just a interested user.)

Comment 8 Mark Thomas 2015-04-21 11:08:59 UTC

(In reply to Fredrik Jonson from comment #7)
> (In reply to Mark Thomas from comment #6)
> 
> > It means you don't have to cache the misses since misses have their own DoS
> > potential.
> 
> Is it correct that there are two possible DOS attacks with dynamic charset
> loading, i.e with the patch applied:
> 
>  1. force permanent increased memory during runtime with requests that
> specify
>     all avaliable but previously unloaded charsets.
> 
>  2. force expensive charset name/instance lookup misses with requests
>     that specify non-existent charset names.

Just to nit pick, the expensive lookups are (at least were when this code was written) for any Charset (valid or invalid) that are not in the cache.

> Any other?
> 
> The consequence of threat 1 is limited. It is self-exhausting as soon as all
> available charsets have been loaded once.

Agreed.

> One mitigation could be to use weak references, if it's worth it?

I don't think so.

> Threat 2 could be mitigated if charset loading is performed only when the
> charset name is valid. Preloading all available charset names on startup
> will block threat 2, right?

Correct.

> Assuming of course that prefetching all charset
> names doesn't nullify the memory reduction.

I think it does. At keast the way we get the names of the valid charsets currently it does,

> BTW, assuming the proposal is accepted, I'd like to see dynamic loading as
> the new default, and if necessary an option to make tomcat fall back to the
> old pre-loading behaviour. Correct and lean should be the default, not an
> option.

No objections there.

> (N.B I'm not a tomcat committer, just a interested user.)

A few good patches can soon change that ;)

The discussion in bug 51400 suggests that the performance characteristics we were trying to avoid have changed in Java 7 (i.e. Tomcat 8 onwards). That makes it more important we have some good test results on which to base any decision.

Comment 9 Mark Thomas 2015-05-19 13:48:42 UTC

I've been doing some further investigations and I suspect that the results are highly dependent on the JVM version used.

With recent Java 8 releases the memory retained by this cache is relatively low (around 100k) and while you can reduce that by not preloading, the gains you make are lost by what you have to add to account for lazy-loading while retaining the performance benefits.

Given that there is not a compelling case for changing the current behaviour, I am resolving this as WONTFIX.

Comment 10 Konstantin Kolinko 2016-01-14 11:21:53 UTC

FYI:
A new enhancement request - bug 58859 "Allow to limit charsets / encodings supported by Tomcat" - offers a similar feature, but with different requirements and background.