Bug 61197 - Breaking change in Content-Type / Character Encoding handling
Summary: Breaking change in Content-Type / Character Encoding handling
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 8.5.15
Hardware: All All
: P2 regression (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
Depends on:
Reported: 2017-06-19 00:19 UTC by Matthew Shaw
Modified: 2017-06-19 18:41 UTC (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Matthew Shaw 2017-06-19 00:19:34 UTC
I *believe* this constitutes some level of regression, based on distinct difference from prior behaviour, but please correct me if I'm wrong :) Also I couldn't find any clear mention of this change in the change log for 8.5.15.

Prior to 8.5.15 (specifically, this commit: https://github.com/apache/tomcat/commit/b2bab804b543bfe181fe435efe35628ce0e21b39) the behaviour of `org.apache.catalina.connector.Response` when setting the content-type with encoding parameter included, e.g. `setContentType("application/json;charset=MS932")`, was to simply take the provided encoding string and set this for the output.

As long as the character set was supported by the JVM (as a specific code page, or an alias of one of the supported code pages), requests would return with the *exact* character set string provided.

Since the above commit / 8.5.15 release, this is now forcibly modified with no option to disable such behaviour. For instance, if I specify "MS932" or "windows-932" this is replaced now with "windows-31j" , or "eucjis" with "EUC-JP", "sjis" with "Shift-JIS", etc.

This may seem like a reasonable behaviour for modern systems that we would *hope* support mapping aliased encodings, but with legacy systems unable to handle this (and any system that, stupidly or otherwise, checks for a specific encoding string, possibly in a case-sensitive manner), suddenly we have broken behaviour. The client expects one encoding string and receives something equivalent but that it just can't handle.

Unfortunately I'm now stuck in this situation as a legacy-systems integrations engineer. We *have* to be able to provide our output with very specific encoding strings set or else several dozen systems we (sadly) can't change will break. Thankfully we caught this in internal testing of the upgrade to 8.5.15 and can put it off temporarily, but we're now also stuck with either needing to maintain our own patched version of Tomcat to revert this behaviour, not continue updating (not a real option given security requirements), or possibly review migrating to an alternative servlet container (please no q_q).
Comment 1 Mark Thomas 2017-06-19 16:50:09 UTC
The change relates to this entry in the change log:

Start to switch to using Charset rather than String to store encoding configuration settings to reduce the number of places the associated Charset needs to be looked up. (markt)

The primary drivers for the change were performance (the repeated String -> Charset calls were relatively expensive) and earlier error reporting when an invalid value was provided.

There might be an alternative way of setting the charset that avoids this restriction. I'll take a look. If that doesn't work, preserving the user provided value is another option.
Comment 2 Mark Thomas 2017-06-19 18:41:23 UTC
Fixed in:
- trunk for 9.0.0.M22 onwards
- 8.5.x for 8.5.16 onwards