Bug 63235 - Defer potentially expensive Charset.availableCharsets() call
Summary: Defer potentially expensive Charset.availableCharsets() call
Status: RESOLVED FIXED
Alias: None
Product: Tomcat 9
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 9.0.16
Hardware: PC Mac OS X 10.1
: P2 enhancement (vote)
Target Milestone: -----
Assignee: Tomcat Developers Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-06 00:49 UTC by Phillip Webb
Modified: 2019-03-14 13:33 UTC (History)
1 user (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Phillip Webb 2019-03-06 00:49:22 UTC
The `org.apache.tomcat.util.buf.B2CConverter` class includes a static `encodingToCharsetCache` map populated in a static initializer by calling `Charset.availableCharsets()`. 

This is quite an expensive operation and it may be unnecessary as calls to `B2CConverter.getCharset` often use the standard `UTF-8` charset. Some local testing shows that if we don't create the cache then we save 157 classes from being loaded and improve startup time.
Comment 1 Phillip Webb 2019-03-06 00:56:10 UTC
Pull request for consideration: https://github.com/apache/tomcat/pull/142
Comment 2 Mark Thomas 2019-03-06 08:56:10 UTC
Lazy loading simply defers the performance hit so a user experiences a delay during a request rather than it happening at startup. The balance of opinion in the past has been that it is better to take such a hit on startup rather than during a request.

This has been discussed previously. For the history see these bugs and the bugs linked from them:
- bug 57808
- bug 58859

If we take some ideas from each of the various solutions that have been proposed at various points then I think it might be possible to come up with something that addresses all of the various concerns:
- retain a cache of charset name to Charset
- populate the keys only from a list of known charsets generated off-line by
  looking at the various JREs (OpenJDK, Oracle, IBM to start with) at system start
- add a unit test to monitor for new Charsets in new JRE versions
- populate the cache values at start-up for 'popular' charsets (ISO-8859-1, UTF-8
  plus any others we think are likely to be used)
- lazily populate any cache values for other Charsets
- if a charset is requested, the key is present but the JVM can't create it, remove the
  key from the cache to speed up any future requests (saves trying to load non-existent
  Charset again)

As I typed the above I though of perhaps a simpler variation:
- retain a cache of charset name to Charset
- populate the cache values at start-up for 'popular' charsets (ISO-8859-1, UTF-8
  plus any others we think are likely to be used)
- lazily populate any cache values for other Charsets
- if a requested Charset does not exist, insert a special "NoSuchCharset" object to speed
  up future requests


The more I think about it, the more I like that second option.

Thoughts?
Comment 3 Remy Maucherat 2019-03-06 09:07:32 UTC
I think this adds complexity and possible problems for no real benefit. For example option 2 could have a very big cache (all you need to do is request random charset values) and could need a lot more sync than a static cache (which is basically free).
Comment 4 Mark Thomas 2019-03-06 09:24:49 UTC
Of course. That makes option 2 a non-starter.

The other option isn't that far from the current code. I think it is worth looking at what it does to memory footprint and start-up time so we can see if the added complexity is worth the benefit.
Comment 5 Phillip Webb 2019-03-06 19:05:07 UTC
@Mark Thanks for pointing me at those previous issues, I should have thought to search first!

It seems like I may have misunderstood the reason for the cache existing in the first place. My original assumption was that it was necessary for case-insensitive name lookups. It appears, on OpenJDK at least, calling `Charset.forName` works with any case combination works. The following will run without error:

	Set<String> names = new LinkedHashSet<>();
	Charset.availableCharsets().forEach((key, value) -> {
		names.add(value.name());
		names.addAll(value.aliases());
	});
	names.forEach(name -> {
		System.out.println(Charset.forName(name.toLowerCase()));
	});

So the main reasons for the cache, as I understand it, are:

1) Improve lookup performance for initial requests by having a hot cache
2) Prevent DoS attacks by limiting the impact of a user requesting an invalid charset

Looking at each of these in turn:


1) Improve lookup performance for initial requests by having a hot cache

> populate the cache values at start-up for 'popular' charsets (ISO-8859-1, UTF-8 plus any others we think are likely to be used)

I like this suggestion and I think this would be a good match for most users. Popular charsets would be fast, and initial requests to others would take a slight hit. The `Charset` class itself includes a cache, so this could be as simple as calling `Charset.forName` a few times.


2) Prevent DoS attacks by limiting the impact of a user requesting an invalid charset

Looking at the existing JDK `Charset` and `AbstractCharsetProvider`[1] class, it appears that asking for charsets that don't exist does not have an impact on memory. When I run the following code, I don't see a large change:

	for (int i = 0; i < 10000000; i++) {
		try {
			Charset.forName("missing-" + i);
		}
		catch (Exception ex) {
		}
		if (i % 10000 == 0) {
			System.out.println(Runtime.getRuntime().totalMemory());
			System.out.println(Runtime.getRuntime().maxMemory());
			System.out.println(Runtime.getRuntime().freeMemory());
			System.out.println("--");
		}
	}

Of course, this might be JVM specific so it may well be worth still protecting against that. One simple option might be to count lookup misses and after a certain threshold switch back to a set limited by availableCharsets. Something like:


    public static Charset getCharset(String enc) throws UnsupportedEncodingException {
        Map<String, Charset> cache = encodingToCharsetCache;
        if (cache != null) {
            Charset charset = cache.get(enc.toLowerCase());
            if (charset == null) {
                throw new UnsupportedEncodingException(
                        sm.getString("b2cConverter.unknownEncoding", enc));
            }
        }
        try {
            return Charset.forName(enc);
        } catch (Exception ex) {
            int count = charsetMissCount.incrementAndGet();
            if (count > THRESHOLD) {
                encodingToCharsetCache = populateEncodingCache();
            }
            throw ex;
        }
    }


I'm happy to rework the PR if that's a sensible suggestion? Even if it were opt-int, I think it would very nice to not have to pay the startup cost that the existing code incurs, especially if you use an embedded Tomcat and restart frequently during application development.

[1] http://cr.openjdk.java.net/~sherman/6898310/webrev/src/share/classes/sun/nio/cs/AbstractCharsetProvider.java.html
Comment 6 Remy Maucherat 2019-03-06 19:32:47 UTC
I don't see the benefit as there's a clear downside and the complexity skyrockets. IMO you should focus on the String optimization since it doesn't have these issues.

On my desktop computer, I measured the "all charset" loading code using System.nanoTime as taking 31105409ns, so that would be 31 ms. IMO, even as an optin and when using something where it is harmless like an IDE, I think users would be best served not wasting their time configuring such an option. On a server, I think these are CPU cycles reasonably well spent given the as-good-as-possible runtime performance and robustness.
Comment 7 Phillip Webb 2019-03-06 20:02:31 UTC
> I don't see the benefit as there's a clear downside and the 
> complexity skyrockets. IMO you should focus on the String 
> optimization since it doesn't have these issues.

I'll continue with the String optimizations as well, but they only help with memory and don't give any startup performance improvements.

> On my desktop computer, I measured the "all charset" loading code using
> System.nanoTime as taking 31105409ns, so that would be 31 ms.

For me, saving 31 ms on each restart is a very worthwhile gain and something I'd love to be able to opt in to. I even think that when running on the server this kind of optimization is something I would find useful in many situations. For example, if I have many internal services that I know will all be hit with only UTF-8, why wouldn't I want to enable the option?

I'd love to be able to find a solution that helps this use-case without adding too much complexity. From the comments on #57808 and #58859 it doesn't appear that I'm alone.
Comment 8 Mark Thomas 2019-03-06 21:15:33 UTC
Just to note that the performance issue with unknown charsets wasn't memory related. It was lock contention. Bug 51400 has the details.

Can you provide some context around why 31ms is significant for the use case you are considering? It doesn't seem significant for the typical use cases we see.

I do plan to look at the first option I described in comment #2. There are a few things on my TODO list so if you wanted to re-work your PR in that direction it would certainly help progress things.

Another thing worth doing is testing with a more recent JVM to see if the original issue still exists. If it doesn't, we have a very simple solution available - remove the cache entirely.
Comment 9 Phillip Webb 2019-03-06 21:31:48 UTC
> Can you provide some context around why 31ms is significant for the use 
> case you are considering? It doesn't seem significant for the typical use cases we see.

Spring Boot uses Embedded Tomcat as it's default servlet container and startup time has been quite a focus in recent months. As much as possible we want developers to be able to restart their applications quickly during development time. We're also seeing a lot of interest in using Spring Boot in FaaS environments. These are quite different to traditional deployments as applications are only started on demand, and are closed very quickly.


> I do plan to look at the first option I described in comment #2. There are a few things 
> on my TODO list so if you wanted to re-work your PR in that direction it would certainly 
> help progress things.

I'll take another look. I made one change but I was missing the context of #51400. Looking at the
Java 8 source code [1], I'd say that's still an issue.


[1] https://github.com/openjdk-mirror/jdk/blame/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/sun/nio/cs/FastCharsetProvider.java#L131-L135
Comment 10 Phillip Webb 2019-03-06 23:28:29 UTC
@Mark

I've updated the PR based on your comments. There is now a cache for concurrent access as well as protection against DoS attacks. I've also tried to add some tests. The list of common charsets and the DoS threshold will no doubt need refining.

If you want me to update it again with an option so that this is opt-in rather than always on please let me know (and point me to an example of how you do this kind of thing).
Comment 11 Mark Thomas 2019-03-08 20:52:04 UTC
I've put together an alternative proposal:
https://github.com/apache/tomcat/pull/146
Comment 12 Mark Thomas 2019-03-14 13:33:03 UTC
Fixed in:
- master for 9.0.18 onwards
- 8.5.x for 8.5.40 onwards
- 7.0.x for 7.0.94 onwards