Bug 49464

Summary: DefaultServlet and CharacterEncoding
Product: Tomcat 6 Reporter: Felix Schumacher <felix.schumacher>
Component: CatalinaAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: enhancement    
Priority: P2    
Version: unspecified   
Target Milestone: default   
Hardware: All   
OS: All   

Description Felix Schumacher 2010-06-18 08:13:17 UTC
DefaultServlet doesn't set a character encoding. As per spec the encoding of the page is asssumed to be iso-8859-1. 

If files are served with a different encoding, this can lead to display problems in the browser.

The problem has been discussed in http://old.nabble.com/DefaultServlet-doesn%27t-set-charset-td18893115.html#a18929527
In http://marc.info/?l=tomcat-user&m=127678462332564&w=2 another component was added - namely mod_jk/httpd - which set a character encoding on its own, if no character set was set previously.

There are at least three different solutions for this problem. 

One of them is extending DefaultServlet to be configurable to include a charset in the response. A patch has been proposed by Markus Schönhaber (MKS) and can be found at
  http://www.ddt-consult.de/sendCharset.patch

The two other solutions which were discussed are
 * configure httpd/mod_jk properly by adding AddDefaultCharset ENCODING to the right location/host
 * Use a filter to set the character encoding

All in all I still think it would be a good idea to explicitly set the wanted encoding in the first possible place, which is the DefaultServlet.
Comment 1 Mark Thomas 2017-06-30 16:27:46 UTC
I've been digging into this and I think the situation is a little more complicated.

There are three scenarios to consider:
a) directly returning a file
b) including a file into an output stream
c) including a file into a writer

a) is the simple case. We can set the character encoding to be the effective value of fileEncoding (i.e. the value or system default it not set)

b) and c) are trickier. In both cases we need to read the input as characters (conversion form bytes via fileEncoding). Then for b) we need to write it out again using whatever output encoding has been set on the response. c) we can just write the characters and let the write handle it.

I think that covers all the cases although some edge cases may emerge as I dig into this.

As far as I can see this can all be done without any additional configuration options. I'm not so sure it can be done without changing some method signatures. While those methods are protected and internal to Tomcat, the default servlet is something that tend to get 'tweaked' by users so we'll need to tread carefully if we back-port any of this.
Comment 2 Christopher Schultz 2017-06-30 21:13:36 UTC
(In reply to Mark Thomas from comment #1)
> I've been digging into this and I think the situation is a little more
> complicated.
> 
> There are three scenarios to consider:
> a) directly returning a file
> b) including a file into an output stream
> c) including a file into a writer
> 
> a) is the simple case. We can set the character encoding to be the effective
> value of fileEncoding (i.e. the value or system default it not set)

What if web.xml contains a <mime-type> which includes a charset parameter? I think respecting that parameter would be good if possible.

> b) and c) are trickier. In both cases we need to read the input as
> characters (conversion form bytes via fileEncoding). Then for b) we need to
> write it out again using whatever output encoding has been set on the
> response. c) we can just write the characters and let the write handle it.

I'm assuming that binary file types are basically out-of-scope here, right?

> I think that covers all the cases although some edge cases may emerge as I
> dig into this.
> 
> As far as I can see this can all be done without any additional
> configuration options. I'm not so sure it can be done without changing some
> method signatures. While those methods are protected and internal to Tomcat,
> the default servlet is something that tend to get 'tweaked' by users so
> we'll need to tread carefully if we back-port any of this.

+1
Comment 3 Mark Thomas 2017-06-30 21:40:08 UTC
If the response character encoding is set (via any of the available means to do so) then the patch will respect that.

Correct, binary files are out of scope. I'll double check the patch doesn't impact them.

Fixed in:
- trunk for 9.0.0.M23 onwards
- 8.5.x for 8.5.17 onwards
- 8.0.x for 8.0.46 onwards
- 7.0.x for 7.0.80 onwards
Comment 4 Mark Thomas 2017-06-30 21:48:49 UTC
Whoops. Binary files are caught in this. That needs fixing. Thanks for the hint.
Comment 5 Mark Thomas 2017-06-30 22:03:03 UTC
And fixed. Same versions as above.
Comment 6 Mark Thomas 2017-07-31 10:22:57 UTC
Re-opening. The first attempt at fixing this triggered a series of regressions. The fix has therefore been reverted in 7.0.x, 8.0.x and 8.5.x.

This needs more careful consideration. The end result may be that it is only fixed for 9.0.x
Comment 7 Mark Thomas 2017-07-31 19:49:02 UTC
With a significant increase in the number of unit tests and a number of additional regressions fixed, this is now fixed again for 9.0.x.

Given the history of regressions, I do not propose back-porting this to earlier versions as this time.