I have RewriteValve configured for the ROOT context which is the only context in my deployment setup. I have many RewriteRules like RewriteRule ^/abc/(.*)$ /xyz.do?param=$1 [L] where a part of the URL gets rewritten as querystring when I access the URL with non-ASCII characters like http://www.example.com/abc/在线测试 The page does not load. Checking the rewrite and access logs I found Rewrite Log Rewrote /abc/在线测试 as /xyz.do?param=在线测试 with rule pattern ^/abc/(.*)$ Access log /xyz.do?param=???? This issue only happens when a part of the URL gets rewritten as the querystring. Checking the source code of the RewriteValve chunk.append(URLEncoder.DEFAULT.encode(urlString)); I found the querystring is not encoded. Is this the cause of the problem?
Using java 1.7.0_76 RewriteValve is configured at the context level.
I've added a test case and as far as I can tell, Tomcat is doing the right thing here. I need need to be very careful to ensure that everything was configured to use UTF-8. Until I did that, I saw all sorts of test failures.
I configured UTF-8 everywhere as mentioned in https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q1 1. I set URIEncoding="UTF-8" and useBodyEncodingForURI="true" for the connectors in server.xml 2. configured org.apache.catalina.filters.SetCharacterEncodingFilter at the top in web.xml 3. Explicitly set strict servlet compliance to false using setenv.sh 4. Ensured that no other valves like Request Dumper Valve in http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html was configured. Still the rewritten querystring was getting mangled. Also when there is a redirect like RewriteRule ^/abc/(.*)$ /xyz/$1 [R] In access log "GET /xyz/????.xml HTTP/1.1" the non-ASCII characters in the request URI is also getting mangled like this
I tested by encoding the querystring in the RewriteValve if (queryString != null) { request.getCoyoteRequest().queryString().setString(null); chunk = request.getCoyoteRequest().queryString().getCharChunk(); chunk.recycle(); //chunk.append(queryString); chunk.append(URLEncoder.DEFAULT.encode(queryString)); request.getCoyoteRequest().queryString().toChars(); }
I tested by encoding the querystring in the RewriteValve if (queryString != null) { request.getCoyoteRequest().queryString().setString(null); chunk = request.getCoyoteRequest().queryString().getCharChunk(); chunk.recycle(); //chunk.append(queryString); // encode queryString chunk.append(URLEncoder.DEFAULT.encode(queryString)); request.getCoyoteRequest().queryString().toChars(); } I found that = and & in the querystring also was getting encoded I had to decode the encoded = and & to get it to work as expected. String encQueryString = URLEncoder.DEFAULT.encode(queryString); encQueryString = encQueryString.replaceAll("%3D","="); encQueryString = encQueryString.replaceAll("%26","&"); chunk.append(encQueryString); After making these changes the URL rewriting was working fine Access log /xyz.do?param=%E5%9C%A8%E7%BA%BF%E6%B5%8B%E8%AF%95 I wrote a test jsp and request.getParameter was working fine and retrieved the non-ASCII data with both GET and POST methods Because the rewritten and done flags are set in case of redirect, the issue still remain when there is a redirect.
Re-opening due to additional information being provided by the reporter. Preethi, would it be possible for you to write a test case for this? Configuration + sample output, expected output versus actual output?
Unless a test case that demonstrates this issue is forthcoming, this will get resolved as WORKSFORME again.
Created attachment 34176 [details] Test case for non-ascii characters in querystring and with redirect flag Added testcases using TestRewriteValve.java for non-ascii querystring and non-ascii in request uri(for redirect case).
Created attachment 34177 [details] Non-Ascii QueryString failure trace
Created attachment 34178 [details] Non-Ascii Redirect failure trace
Added testcase for the mentioned cases
Thanks for the test cases. They really do make bug reports a lot easier to work with. I'll start looking at them today.
Progress is being made. The test cases uncovered some inconsistencies in handling of encoded urlPatterns obtained from web.xml. These have been fixed in trunk (r1758423). I have added the test cases (some clean-up and a few fixes such as setting the response body to UTF-8) and fixed the remaining issues with the rewrite valve in trunk (r1758425). There are a few more related edge cases I want to test. Once that is complete I'll look at back-porting these changes to 8.5.x and 8.0.x.
This has been fixed in the following branches: - 9.0.x for 9.0.0.M10 onwards - 8.5.x for 8.5.5 onwards - 8.0.x for 8.0.37 onwards
I am testing the RewriteValve in the trunk after the issue fix. QueryStrings in the case of redirect are still not encoded.
Created attachment 34205 [details] Testcase for Redirect with QueryStrings Both cases testNonAsciiQueryString and testNonAsciiQueryStringAndRedirectPath fail when a redirect flag is added.
Thanks for the additional tests. There isn't going to be a perfect fix without some potentially invasive changes to the RedirectValve. The problem is that encoding is not always the inverse of decoding. The Valve would have to be re-factored to work with undecoded URIs. An imperfect fix that ignored some edge cases should be fairly simple.
Created attachment 34208 [details] testNonAsciiQueryString failure trace
Created attachment 34209 [details] testNonAsciiQueryStringAndRedirectPath failure trace
I've done some further digging and adding support for the 'B' flag looks like a better way to solve this since that is more closely aligned with httpd. Need to see if that is possible.
After further investigation, httpd won't handle some of these cases either. Given that the intention is to mimic, as far as possible, httpd functionality then I am currently leaning towards a solution that does that. That does mean some of the current tests cases will need adjustment else they will start failing. I will take another look at working with the undecoded URI but that presents its own set of problems.
I've reworked this in trunk/9.0.x. Please test and provide feedback. Assuming there are no issues, I'll back-port this for the next set of releases.
Thanks for the update. Testing the changes in trunk I found some failure cases. Attached the testcase
Created attachment 34220 [details] B flag failure cases
I don't think using B flag for URLs with non-ASCII characters is the right approach. I am using httpd mod_rewrite for URL rewrite now and so far there have been no issues in non-ASCII characters getting rewritten as querystring. Supporting B flag will not solve the issue in the URLs in the attached testcases as &,= will also get encoded.
Let me do some more investigation.
I think I have a better understanding of what is going on. Both Tomcat and httpd decode and normalize the URI before rewriting. The problem occurs later. Consider the following rule: RewriteRule /a/(.*) /b?$1 With httpd the query string is as follows: /a/id=%61 -> id=a /a/id=%3d -> id== /a/id=%3f -> id=? /b?id=%61 -> id=%61 /b?id=%3d -> id=%3d /b?id=%3f -> id=%3f Note how the query string varies depending on whether the request goes through the rewrite rule or not. The problem isn't limited to the query string but it does illustrates the problem. Note also that an application isn't going to be able to unambiguously parse some of those re-written query strings and that using the [B] flag won't help in this since it will incorrectly encode the first '='. In Tomcat the problem is more immediately apparent because the query string has to be presented in encoded form. I think we can get closer to the httpd behaviour but there are always going to be edge cases we won't be able to fix.
I think the rewritten URL should be encoded(similar to mod_rewrite approach). In the fix provided in 8.0.37 release, the request-uri and querystring are encoded after rewrite. Handling redirect in a similar fashion will solve this problem. Please update any progress in the RewriteValve.
Those most recent tests aren't consistent with httpd's behaviour. Both Tomcat and httpd process rewrites in very similar ways (to a point). The original query string is retain in encoded form. The path URI is processed in decoded and normalized form. Any query string generated by the rewrite is in decoded form (note this differs from the original). Before any rewrite, any decoded elements are encoded. The NE flag disables the encoding of decoded elements on rewrite. The B flag encodes back references. It is therefore possible for unwanted double encoding to occur when B and R are combined. It is also possible for UTF-8 bytes to appear directly in the location header when R is combined with NE. I'm fairly confident of the latest code but I'll leave it a little while before back-porting to give you a chance to test it and provide feedback.
Thanks for the update. The concerns when R flag is used with NE and B flag are valid. At a glance, the current fix seems to be fine. I will test the latest changes and get back to you.
There seems to be a double encoding issue in QSA append. I added a test case with L and QSA flag which failed. Looks like the original query string as well as the part of the request URI which got rewritten as querystring are getting encoded. This needs to be handled in the isQsappend code block.
Created attachment 34226 [details] QueryString Append Failure
It appears that httpd (at least as provided by 2.4.7 in Ubuntu 14.04.5 which I am using to test) sometimes double encodes when QSA is used as well. I've reworked the QSA handling, added your test case, fixed another test case that was broken by this fix and made sure your name is in the changelog. Thanks for sticking with this. Let us know how you get on. I'd like to get this all sported out in time for the next release scheduled for the end of the month.
Thanks for the update. I am testing the latest changes. So far I found a failure when there is a rule with QSA but no querystring is supplied. It is probably due to the absence of null check in QSA handling code. I will let you know after testing more thoroughly.
Created attachment 34247 [details] failure case for QSA
Any update in QSA issue?
Created attachment 34281 [details] fix for failure case (QSA) Mark, this is a small fix for the test "failure case for QSA (attachment 34247 [details])" provided by Santhana Preethi (comment 35)
Comment on attachment 34281 [details] fix for failure case (QSA) Thanks variation of the patch applied.
Thanks for the fix.
This has been fixed in the following branches: - 9.0.x for 9.0.0.M11 onwards - 8.5.x for 8.5.6 onwards - 8.0.x for 8.0.38 onwards