Bug 48550

Summary: Update examples and default server.xml to use UTF-8
Product: Tomcat 7 Reporter: Konstantin Kolinko <knst.kolinko>
Component: ExamplesAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED FIXED    
Severity: enhancement CC: kiralyattila.hu, michael.sonnleitner
Priority: P2    
Version: trunk   
Target Milestone: ---   
Hardware: PC   
OS: All   

Description Konstantin Kolinko 2010-01-14 13:52:03 UTC
It is just an idea, but I think that with Tomcat 7 we can update our server.xml and our examples to use UTF-8.

That is:

1) add URIEncoding="UTF-8" to HTTP and AJP <Connector> elements in the default server.xml

2) configure SetCharacterEncodingFilter in the examples webapp

3) update Servlet and JSP examples to allow UTF-8 input (1) and 2) will provide that) and to use UTF-8 as their output character encoding

4) the servlet/JSP sources will probably stay as ISO-8859-1, as they are now

Please add, if I missed anything.


For reference:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

I think we are a bit busy right now, so I am filing this issue, supposing that a more detailed discussion can be raised later on dev@ or users@.
Comment 1 unnivm 2010-11-26 07:00:25 UTC
Please clarify what is the actual difference between these two statements:

3) update Servlet and JSP examples to allow UTF-8 input (1) and 2) will provide
that) and to use UTF-8 as their output character encoding

4) the servlet/JSP sources will probably stay as ISO-8859-1, as they are now
Comment 2 Christopher Schultz 2010-12-03 11:31:00 UTC
(In reply to comment #1)
> Please clarify what is the actual difference between these two statements:
> 
> 3) update Servlet and JSP examples to allow UTF-8 input (1) and 2) will provide
> that) and to use UTF-8 as their output character encoding
> 
> 4) the servlet/JSP sources will probably stay as ISO-8859-1, as they are now

#3 means changing the examples webapp to accept UTF-8 input (shouldn't be a big deal, as #1 and #2 provide that, as mentioned) and to set the <%@page pageEncoding="UTF-8" @> in order to set the output encoding.

#4 means that we won't bother re-encoding all of the JSP files as UTF-8 because a) such a change would be surprising to users and b) it is not necessary as those pages are probably all using pure ASCII at this point anyway
Comment 3 Attila Király 2010-12-17 13:23:50 UTC
As a user of Tomcat and a webapp developer I would really like to see the 1) added to the default server.xml. I mostly develop apps using utf-8 encoding and if the customer is using tomcat extra care is needed to either not use non iso-8859-1 characters in query parameters or convincing them to modify the tomcat configuration (from these options the former is always the easier).

Some test results:
- Glassfish 3.0.1 documentation contains a similar, optional (default value "UTF-8") attribute called "uri-encoding" on the "http" element in the "domain.xml" (mentioned here: http://docs.sun.com/app/docs/doc/821-1753/girlq?l=en&a=view#indexterm-246 ). Unfortunately it does not have any effect on query encoding (tried it with different values but always ISO-8859-1 was used to decode query parameters). This might be a bug in GF but the intention is there.
- On client side FF 3.6, Chrome 8, Opera 11 and IE9 Beta (and as I found on the web older versions too) use the character encoding of the page to encode the query parameters. So if the html is served with utf-8 encoding the query parameters are encoded with utf-8.
Comment 4 Attila Király 2010-12-17 13:33:51 UTC
One more info. Further test revealed that Glassfish 3.0.1 is actually using the request encoding for query parameter decoding. Calling request.setCharacterEncoding("UTF-8"); triggered UTF-8 based decoding for parameters.
Comment 5 Christopher Schultz 2010-12-17 14:35:27 UTC
(In reply to comment #3)
> - On client side FF 3.6, Chrome 8, Opera 11 and IE9 Beta (and as I found on the
> web older versions too) use the character encoding of the page to encode the
> query parameters. So if the html is served with utf-8 encoding the query
> parameters are encoded with utf-8.

Could you provide references to the above? I had trouble finding official default values for the URL character encoding used by browsers.

There's also the trouble of users being able to override the default and revert back to (most likely) ISO-8859-1 encoding.

Right now, I'm -1 for making URIEncoding="UTF-8" by default since it might break a lot of servers, but I'm willing to be convinced. For the record, I always set URIEncoding="UTF-8" on my projects but we don't want an out-of-the-box server configuration to surprise anyone.
Comment 6 Attila Király 2010-12-18 05:54:48 UTC
(In reply to comment #5)
> (In reply to comment #3)
> > - On client side FF 3.6, Chrome 8, Opera 11 and IE9 Beta (and as I found on the
> > web older versions too) use the character encoding of the page to encode the
> > query parameters. So if the html is served with utf-8 encoding the query
> > parameters are encoded with utf-8.
> 
> Could you provide references to the above? I had trouble finding official
> default values for the URL character encoding used by browsers.

I am afraid I can not give official references. The exact browser versions mentioned above were tested by me (with UTF-8 and ISO-8859-1 encoded pages-links) and those work like I wrote. But it is also mentioned in
- Tomcat wiki: http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q9
"Many browsers are starting to offer (default) options of encoding URIs using UTF-8 instead of ISO-8859-1. Some browsers appear to use the encoding of the current page to encode URIs for links (see the note above regarding browser behavior for POST encoding)."
- MozillaZine KB about the Firefox "network.standard-url.encode-query-utf8" config property: 
http://kb.mozillazine.org/Network.standard-url.encode-query-utf8
"For compatibility with these websites, as well as parity with IE and Opera, Mozilla now treats the query portion of a URI (the part following the ?) differently than the rest.[...]
Encode the query portion of IRIs using the same encoding as the current page. (Default)"

Additionally Jetty is also using UTF-8 by default:
Jetty wiki: http://docs.codehaus.org/display/JETTY/International+Characters+and+Character+Encodings#InternationalCharactersandCharacterEncodings-InternationalcharactersinURLs
"The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars and accordingly jetty-6 series uses a default of UTF-8."

> 
> There's also the trouble of users being able to override the default and revert
> back to (most likely) ISO-8859-1 encoding.
> 
> Right now, I'm -1 for making URIEncoding="UTF-8" by default since it might
> break a lot of servers, but I'm willing to be convinced. For the record, I
> always set URIEncoding="UTF-8" on my projects but we don't want an
> out-of-the-box server configuration to surprise anyone.

This is true. However for me it seems the web is moving to an UTF-8 based direction. So I think a change to the default encoding should be made sometimes in Tomcat. That is a backward compatibility issue so it should be made in a major point release. The 7.0 could be that. If it is not done now the next possibility is at 8.0 in the future. I don't say developers can't live without this change I can cope with it as I did it always (I only mentioned my reasons here because this issue was already opened).

Probably my real problem is that query parameter decoding is inconsistent between servlet containers and there is no way to regulate it on a per webapp base (instead of a server wide option) in Tomcat (could use "useBodyEncodingForURI=true" attribute but it still a modification in the server.xml).

I would also be happy with a Jetty like solution. In jetty 7.2 UTF-8 is the default for query decoding but it is overridable with request.setAttribute("org.eclipse.jetty.server.Request.queryEncoding", "ISO-8859-1"); on a per request base. Tomcat could have something like that. So in a filter I could call:
request.setCharacterEncoding("UTF-8"); // for Glassfish 3 query decoding, but it is already done anyway as it is needed for POST-s too for all serlet containers
request.setAttribute("org.eclipse.jetty.server.Request.queryEncoding", "UTF-8"); // for Jetty, just to be sure
request.setAttribute("org.apache.tomcat.Request.queryEncoding or similar", "UTF-8"); // for Tomcat 7 and up
and could get a safe portable way for at least 3 servlet containers.
Comment 7 Peter Flynn 2010-12-22 05:16:10 UTC
I had some problems forcing Cocoon output to be UTF-8 (using Tomcat5 and Cocoon 2.1.11) because I didn't realise the default was ISO-8859-1 (everything else in my sitemaps and XSLT was set to UTF-8, which is what puzzled me).

Our internal controls insist on UTF-8 for everything, so this was only exposed when we accessed external resources (which could of course be anything, including Windows-1252).

My gut feeling is that if we are to continue the general move towards end-to-end XML in the business process (or at least, XML-as-early-as-possible), then making the character repertoire uniform is A Good Idea, so a default of UTF-8 would seem very sensible.
Comment 8 Konstantin Preißer 2013-08-01 22:07:10 UTC
Hi,

as this has not been applied to Tomcat 7, what about Tomcat 8?
Comment 9 Mark Thomas 2013-08-07 19:21:45 UTC
Part 1 of the 4 tasks in the description has been completed for trunk (a.k.a 8.0.x)
Comment 10 Mark Thomas 2013-08-07 20:10:14 UTC
This has been fixed for trunk a.k.a 8.0.x and will be included in 8.0.0-RC2 onwards.