Bug 29900

Summary: request params in utf-8 corrupted
Product: Tomcat 5 Reporter: Asher Tarnopolski <ashert>
Component: UnknownAssignee: Tomcat Developers Mailing List <dev>
Status: RESOLVED INVALID    
Severity: blocker    
Priority: P3    
Version: 5.0.25   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   

Description Asher Tarnopolski 2004-07-03 11:19:44 UTC
a parameter sent in request in utf-8 encoding arrives as if it would be sent in
another encoding (iso-xxx, windows-xxx or whatever). works fine with tomcat 4.0.
doesn't work on tomcat 5.0.xx

a jsp code example:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
 
<form act="/tests/utf.jsp" method=post>
<input type=text name=source >
<input type=submit>
<form>
<p>
 
<%
request.setCharacterEncoding("UTF-8");

if(request.getParameter("source")!=null)
{ 
  out.println(request.getParameter("source").length()+"<p>");
 
  out.println(request.getParameter("source"));
 
  StringBuffer sb = new StringBuffer();
  for(int i=0; i<request.getParameter("source").length(); i++)
  {
    if(request.getParameter("source").charAt(i) == '&')
      sb.append("&");
    else
      sb.append(request.getParameter("source").charAt(i));
 
  }
  out.println("<p>"+ sb.toString());
}
%>
 
</body>
</html>

as you see, this code block gets a utf-8 encoded parameter from
a request, outputs its length, the parameter itself, and its html
utf-8 codes.
to test it i send a hebrew letter ALEF. on tomcat 4.xx everything
works perfect and i get the following response:

7
א
&amp;#1488;

(in case you don't see it here, it's 7 , alef as utf-8 code and alef's utf-8
code parsed to be visible in browser)

with tomcat 5.0.xx i get:

1
?
?
Comment 1 Mark Thomas 2004-07-03 15:28:01 UTC
TC5 no longer defaults to using the body encoding for parameters as this is 
not spec compliant. See my standard text on encoding (attached below) for more 
info.


REQUESTS
========

There are a number of situations where there may be a requirement to use non-
US ASCII characters in a URI. These include:
- Parameters in the query string
- Servlet paths

There is a standard for encoding URIs (http://www.w3.org/International/O-URL-
code.html) but this standard is not consistently followed by clients. This 
causes a number of problems.

The functionality provided by Tomcat (4 and 5) to handle this less than ideal 
situation is described below.

1. The Coyote HTTP/1.1 connector has a useBodyEncodingForURI attribute which 
if set to true will use the request body encoding to decode the URI query 
parameters.
  - The default value is true for TC4 (breaks spec but gives consistent 
behaviour across TC4 versions)
  - The default value is false for TC5 (spec compliant but there may be 
migration issues for some apps)
2. The Coyote HTTP/1.1 connector has a URIEncoding attribute which defaults to 
ISO-8859-1.
3. The parameters class (o.a.t.u.http.Parameters) has a QueryStringEncoding 
field which defaults to the URIEncoding. It must be set before the parameters 
are parsed to have an effect.

Things to note regarding the servlet API:
1. HttpServletRequest.setCharacterEncoding() normally only applies to the 
request body NOT the URI.
2. HttpServletRequest.getPathInfo() is decoded by the web container.
3. HttpServletRequest.getRequestURI() is not decoded by container.

Other tips:
1. Use POST with forms to return parameters as the parameters are then part of 
the request body.


RESPONSES
=========

HTML META
 tags are ignored by Tomcat. You may use <%@ page pagEncoding="..." %> for 
JSPs.
Comment 2 Asher Tarnopolski 2004-07-03 20:25:28 UTC
thanks for reply, but...
i edited the server.xml, so that now the coyote settings are these:
<!-- Define a non-SSL Coyote HTTP/1.1 Connector on port 8080 -->
             <Connector port="8080"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               debug="0" connectionTimeout="20000"
               useBodyEncodingForURI="true"  URIEncoding="UTF-8"
               disableUploadTimeout="true" />

i don't see any changes, the problem still exists. by default  i send the request by 
POST. i refectored the code to work with GET to see if URIEncoding makes any 
difference. it doesn't. i'll appriciate your advice.
Comment 3 Mark Thomas 2004-07-04 15:00:49 UTC
This works for me if I add <%@ page pageEncoding="UTF-8" %> to the JSP

This is not a tomcat bug. Please follow up on tomcat-user if you have further 
questions.