Bug 58859

Summary: Allow to limit charsets / encodings supported by Tomcat
Product: Tomcat 9 Reporter: Konstantin Kolinko <knst.kolinko>
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Status: NEW ---    
Severity: enhancement CC: msjs.sumudini
Priority: P2    
Version: unspecified   
Target Milestone: -----   
Hardware: PC   
OS: All   

Description Konstantin Kolinko 2016-01-14 11:17:36 UTC
There was an enhancement request (bug 57808 "Don't preload all charsets").
I want to implement a similar thing, but as a security / paranoid feature.

The issue: A client request can specify an encoding (charset) name. This charset is used to parse request parameters (the query string and parameters in the body of a POST request).

The problem is that a Java Runtime supports many charsets, but I really use only a handful of them (ISO-8859-1, US-ASCII, UTF-8, and several charsets used in my country).

There exists such nasty charset as UTF-7 [1], and some old browser was "nice" to implement it. Luckily the current versions of Java do not implement it (tested Sun/Oracle Java 5/6/7/8), but I really do not know about all of those implemented charsets, and there are some exotic ones among them and some experimental ones (X-*).

Proposal
==========
1. A new system property with the following name:
org.apache.tomcat.SUPPORTED_CHARSETS

2. The following behaviour:
If this property is set to a non-empty string, then in the static initialization block of B2CConverter use the character sets named in this property to populate a Set<Charset> that will be used instead of Charset.availableCharsets() in initialization loop.

For example, if org.apache.tomcat.SUPPORTED_CHARSETS=ISO-8859-1,UTF-8 then Tomcat will only support those two charsets and all aliases of their names. An attempt to use any other character set name will result in an UnsupportedEncodingException.

For Java 8u66 and those two charsets it gives the following allowed names:

ISO-8859-1
819 (alias for ISO-8859-1)
ISO8859-1 (alias for ISO-8859-1)
l1 (alias for ISO-8859-1)
ISO_8859-1:1987 (alias for ISO-8859-1)
ISO_8859-1 (alias for ISO-8859-1)
8859_1 (alias for ISO-8859-1)
iso-ir-100 (alias for ISO-8859-1)
latin1 (alias for ISO-8859-1)
cp819 (alias for ISO-8859-1)
ISO8859_1 (alias for ISO-8859-1)
IBM819 (alias for ISO-8859-1)
ISO_8859_1 (alias for ISO-8859-1)
IBM-819 (alias for ISO-8859-1)
csISOLatin1 (alias for ISO-8859-1)

UTF-8
unicode-1-1-utf-8 (alias for UTF-8)
UTF8 (alias for UTF-8)

This feature applies only to the set of charsets used via B2CConverter class, that is used internally by Tomcat. I think that Jasper does not use it, so it does not apply to the encoding used to write source code of JSP pages.

The difference with enhancement proposed in bug 57808 is that all unnamed charsets are not supported, instead of loading them lazily.


SetCharacterEncodingFilter 
============================
I should also note the following:

The issue of charset name provided by client can also be solved by using a 
 org.apache.catalina.filters.SetCharacterEncodingFilter
that is configured with initialization parameter ignore="true".

This filter is available in all current Tomcat versions (6/7/8/9). Some web frameworks (e.g. Spring) also provide similar filters.

If a web application renders all its pages in UTF-8, then it can expect that all requests to it to use UTF-8 as well.


[1] https://en.wikipedia.org/wiki/UTF-7
[2] http://tomcat.apache.org/tomcat-8.0-doc/config/filter.html#Set_Character_Encoding_Filter
Comment 1 Remy Maucherat 2016-01-14 13:36:26 UTC
That's not a bad idea, but is it really practical in production ? Also historically, UTF-8 has caused the most security issues from what I know, and it isn't going to be possible to disable it.

+1 anyway, but only as a system property (as you proposed) since it is too global + specific.
Comment 2 Christopher Schultz 2016-01-14 15:45:02 UTC
What is the issue here? IIRC, Tomcat has a cache of Charsets it will use, so a client specifying a little-used charset will just thrash that cache a bit.
Comment 3 Konstantin Kolinko 2016-01-14 17:20:47 UTC
Chris, the cache has evolved into a static preloaded set some time ago (since r1140156), it is not updated at runtime.

The issue here is that client-provided charset name is used for processing both of client-provided data and application-provided data (e.g. forward() processing code touched by the recent fix to bug 58836).

Application-provided data usually has some assumptions that the client-provided charset is sane (e.g. superset of US-ASCII). I just am not sure that this assumption is true for all charsets implemented by a JRE - I do not know all of them. E.g. current Java 8 implements 170 charsets, some of which have names starting with "x-".

It is easy to enforce the charset (via SetCharacterEncodingFilter), but that will break the whole ability to specify a charset for a client.

It is possible to implement a similar Filter that checks the provided charset name (probably over some whitelist).