Bug 60013 - Non-ASCII characters in querystring get mangled after URL Rewrite using RewriteValve
Summary: Non-ASCII characters in querystring get mangled after URL Rewrite using Rewri...
Alias: None
Product: Tomcat 8
Classification: Unclassified
Component: Catalina (show other bugs)
Version: 8.0.35
Hardware: PC Mac OS X 10.1
: P2 major (vote)
Target Milestone: ----
Assignee: Tomcat Developers Mailing List
Depends on:
Reported: 2016-08-18 08:01 UTC by Santhana Preethi
Modified: 2016-09-21 13:35 UTC (History)
2 users (show)

Test case for non-ascii characters in querystring and with redirect flag (3.71 KB, text/x-csrc)
2016-08-27 11:24 UTC, Santhana Preethi
Non-Ascii QueryString failure trace (356.18 KB, image/png)
2016-08-27 11:25 UTC, Santhana Preethi
Non-Ascii Redirect failure trace (282.12 KB, image/png)
2016-08-27 11:26 UTC, Santhana Preethi
Testcase for Redirect with QueryStrings (3.04 KB, text/plain)
2016-09-06 12:16 UTC, Santhana Preethi
testNonAsciiQueryString failure trace (211.90 KB, image/png)
2016-09-06 13:22 UTC, Santhana Preethi
testNonAsciiQueryStringAndRedirectPath failure trace (194.94 KB, image/png)
2016-09-06 13:23 UTC, Santhana Preethi
B flag failure cases (2.37 KB, text/plain)
2016-09-07 12:19 UTC, Santhana Preethi
QueryString Append Failure (2.16 KB, text/plain)
2016-09-09 18:27 UTC, Santhana Preethi
failure case for QSA (3.73 KB, text/plain)
2016-09-14 11:35 UTC, Santhana Preethi
fix for failure case (QSA) (2.54 KB, patch)
2016-09-20 18:05 UTC, Tiago Oliveira
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Santhana Preethi 2016-08-18 08:01:05 UTC
I have RewriteValve configured for the ROOT context which is the only context in my deployment setup. 

I have many RewriteRules like 

RewriteRule ^/abc/(.*)$ /xyz.do?param=$1 [L]

where a part of the URL gets rewritten as querystring 

when I access the URL with non-ASCII characters like http://www.example.com/abc/在线测试

The page does not load. Checking the rewrite and access logs I found 

Rewrite Log

Rewrote /abc/在线测试 as /xyz.do?param=在线测试 with rule pattern ^/abc/(.*)$

Access log


This issue only happens when a part of the URL gets rewritten as the querystring. Checking the source code of the RewriteValve 


I found the querystring is not encoded. Is this the cause of the problem?
Comment 1 Santhana Preethi 2016-08-22 13:45:10 UTC
Using java 1.7.0_76
RewriteValve is configured at the context level.
Comment 2 Mark Thomas 2016-08-23 13:25:56 UTC
I've added a test case and as far as I can tell, Tomcat is doing the right thing here. I need need to be very careful to ensure that everything was configured to use UTF-8. Until I did that, I saw all sorts of test failures.
Comment 3 Santhana Preethi 2016-08-24 10:29:48 UTC
I configured UTF-8 everywhere as mentioned in https://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q1 

1. I set URIEncoding="UTF-8" and useBodyEncodingForURI="true" for the connectors in server.xml 
2. configured org.apache.catalina.filters.SetCharacterEncodingFilter at the top in web.xml 
3. Explicitly set strict servlet compliance to false using setenv.sh 
4. Ensured that no other valves like Request Dumper Valve in http://www.mail-archive.com/users@tomcat.apache.org/msg21117.html was configured.

Still the rewritten querystring was getting mangled. 
Also when there is a redirect like 

RewriteRule ^/abc/(.*)$ /xyz/$1 [R]

In access log
"GET /xyz/????.xml HTTP/1.1"

the non-ASCII characters in the request URI is also getting mangled like this
Comment 4 Santhana Preethi 2016-08-24 10:33:37 UTC
I tested by encoding the querystring in the RewriteValve

if (queryString != null) 
     chunk = request.getCoyoteRequest().queryString().getCharChunk();

Comment 5 Santhana Preethi 2016-08-24 10:45:30 UTC
I tested by encoding the querystring in the RewriteValve

if (queryString != null) 
     chunk = request.getCoyoteRequest().queryString().getCharChunk();
     // encode queryString 

I found that = and & in the querystring also was getting encoded 
I had to decode the encoded = and & to get it to work as expected.
     String encQueryString = URLEncoder.DEFAULT.encode(queryString);
     encQueryString = encQueryString.replaceAll("%3D","=");
     encQueryString = encQueryString.replaceAll("%26","&");

After making these changes the URL rewriting was working fine 

Access log 


I wrote a test jsp and request.getParameter was working fine and retrieved the non-ASCII data with both GET and POST methods 

Because the rewritten and done flags are set in case of redirect, the issue still remain when there is a redirect.
Comment 6 Christopher Schultz 2016-08-25 20:47:23 UTC
Re-opening due to additional information being provided by the reporter.

Preethi, would it be possible for you to write a test case for this? Configuration + sample output, expected output versus actual output?
Comment 7 Mark Thomas 2016-08-25 20:54:53 UTC
Unless a test case that demonstrates this issue is forthcoming, this will get resolved as WORKSFORME again.
Comment 8 Santhana Preethi 2016-08-27 11:24:09 UTC
Created attachment 34176 [details]
Test case for non-ascii characters in querystring and with redirect flag

Added testcases using TestRewriteValve.java for non-ascii querystring and non-ascii in request uri(for redirect case).
Comment 9 Santhana Preethi 2016-08-27 11:25:29 UTC
Created attachment 34177 [details]
Non-Ascii QueryString failure trace
Comment 10 Santhana Preethi 2016-08-27 11:26:10 UTC
Created attachment 34178 [details]
Non-Ascii Redirect failure trace
Comment 11 Santhana Preethi 2016-08-27 11:26:38 UTC
Added testcase for the mentioned cases
Comment 12 Mark Thomas 2016-08-29 20:41:43 UTC
Thanks for the test cases. They really do make bug reports a lot easier to work with. I'll start looking at them today.
Comment 13 Mark Thomas 2016-08-30 16:47:16 UTC
Progress is being made.

The test cases uncovered some inconsistencies in handling of encoded urlPatterns obtained from web.xml. These have been fixed in trunk (r1758423).

I have added the test cases (some clean-up and a few fixes such as setting the response body to UTF-8) and fixed the remaining issues with the rewrite valve in trunk (r1758425).

There are a few more related edge cases I want to test. Once that is complete I'll look at back-porting these changes to 8.5.x and 8.0.x.
Comment 14 Mark Thomas 2016-08-30 20:12:12 UTC
This has been fixed in the following branches:
- 9.0.x for 9.0.0.M10 onwards
- 8.5.x for 8.5.5 onwards
- 8.0.x for 8.0.37 onwards
Comment 15 Santhana Preethi 2016-09-06 12:12:32 UTC
I am testing the RewriteValve in the trunk after the issue fix. QueryStrings in the case of redirect are still not encoded.
Comment 16 Santhana Preethi 2016-09-06 12:16:14 UTC
Created attachment 34205 [details]
Testcase for Redirect with QueryStrings

Both cases testNonAsciiQueryString and testNonAsciiQueryStringAndRedirectPath fail when a redirect flag is added.
Comment 17 Mark Thomas 2016-09-06 12:35:20 UTC
Thanks for the additional tests.

There isn't going to be a perfect fix without some potentially invasive changes to the RedirectValve. The problem is that encoding is not always the inverse of decoding. The Valve would have to be re-factored to work with undecoded URIs.

An imperfect fix that ignored some edge cases should be fairly simple.
Comment 18 Santhana Preethi 2016-09-06 13:22:05 UTC
Created attachment 34208 [details]
testNonAsciiQueryString failure trace
Comment 19 Santhana Preethi 2016-09-06 13:23:13 UTC
Created attachment 34209 [details]
testNonAsciiQueryStringAndRedirectPath failure trace
Comment 20 Mark Thomas 2016-09-06 13:34:01 UTC
I've done some further digging and adding support for the 'B' flag looks like a better way to solve this since that is more closely aligned with httpd. Need to see if that is possible.
Comment 21 Mark Thomas 2016-09-06 19:10:35 UTC
After further investigation, httpd won't handle some of these cases either. Given that the intention is to mimic, as far as possible, httpd functionality then I am currently leaning towards a solution that does that. That does mean some of the current tests cases will need adjustment else they will start failing.

I will take another look at working with the undecoded URI but that presents its own set of problems.
Comment 22 Mark Thomas 2016-09-06 21:26:57 UTC
I've reworked this in trunk/9.0.x. Please test and provide feedback. Assuming there are no issues, I'll back-port this for the next set of releases.
Comment 23 Santhana Preethi 2016-09-07 12:17:44 UTC
Thanks for the update. Testing the changes in trunk I found some failure cases. Attached the testcase
Comment 24 Santhana Preethi 2016-09-07 12:19:10 UTC
Created attachment 34220 [details]
B flag failure cases
Comment 25 Santhana Preethi 2016-09-07 12:43:51 UTC
I don't think using B flag for URLs with non-ASCII characters is the right approach. I am using httpd mod_rewrite for URL rewrite now and so far there have been no issues in non-ASCII characters getting rewritten as querystring.

Supporting B flag will not solve the issue in the URLs in the attached testcases as &,= will also get encoded.
Comment 26 Mark Thomas 2016-09-07 14:56:45 UTC
Let me do some more investigation.
Comment 27 Mark Thomas 2016-09-07 15:43:50 UTC
I think I have a better understanding of what is going on.

Both Tomcat and httpd decode and normalize the URI before rewriting. The problem occurs later.

Consider the following rule:

RewriteRule /a/(.*) /b?$1

With httpd the query string is as follows:

/a/id=%61 -> id=a
/a/id=%3d -> id==
/a/id=%3f -> id=?
/b?id=%61 -> id=%61
/b?id=%3d -> id=%3d
/b?id=%3f -> id=%3f

Note how the query string varies depending on whether the request goes through the rewrite rule or not. The problem isn't limited to the query string but it does illustrates the problem. Note also that an application isn't going to be able to unambiguously parse some of those re-written query strings and that using the [B] flag won't help in this since it will incorrectly encode the first '='.

In Tomcat the problem is more immediately apparent because the query string has to be presented in encoded form.

I think we can get closer to the httpd behaviour but there are always going to be edge cases we won't be able to fix.
Comment 28 Santhana Preethi 2016-09-08 12:41:46 UTC
I think the rewritten URL should be encoded(similar to mod_rewrite approach). In the fix provided in 8.0.37 release, the request-uri and querystring are encoded after rewrite. Handling redirect in a similar fashion will solve this problem. 

Please update any progress in the RewriteValve.
Comment 29 Mark Thomas 2016-09-09 13:51:32 UTC
Those most recent tests aren't consistent with httpd's behaviour.

Both Tomcat and httpd process rewrites in very similar ways (to a point).

The original query string is retain in encoded form.
The path URI is processed in decoded and normalized form.
Any query string generated by the rewrite is in decoded form (note this differs from the original).
Before any rewrite, any decoded elements are encoded.

The NE flag disables the encoding of decoded elements on rewrite.

The B flag encodes back references.

It is therefore possible for unwanted double encoding to occur when B and R are combined.

It is also possible for UTF-8 bytes to appear directly in the location header when R is combined with NE.

I'm fairly confident of the latest code but I'll leave it a little while before back-porting to give you a chance to test it and provide feedback.
Comment 30 Santhana Preethi 2016-09-09 14:04:22 UTC
Thanks for the update. The concerns when R flag is used with NE and B flag are valid. At a glance, the current fix seems to be fine. I will test the latest changes and get back to you.
Comment 31 Santhana Preethi 2016-09-09 18:20:34 UTC
There seems to be a double encoding issue in QSA append. I added a test case with L and QSA flag which failed. 
Looks like the original query string as well as the part of the request URI which got rewritten as querystring are getting encoded. This needs to be handled in the isQsappend code block.
Comment 32 Santhana Preethi 2016-09-09 18:27:12 UTC
Created attachment 34226 [details]
QueryString Append Failure
Comment 33 Mark Thomas 2016-09-12 16:02:11 UTC
It appears that httpd (at least as provided by 2.4.7 in Ubuntu 14.04.5 which I am using to test) sometimes double encodes when QSA is used as well.

I've reworked the QSA handling, added your test case, fixed another test case that was broken by this fix and made sure your name is in the changelog.

Thanks for sticking with this. Let us know how you get on. I'd like to get this all sported out in time for the next release scheduled for the end of the month.
Comment 34 Santhana Preethi 2016-09-14 11:32:34 UTC
Thanks for the update. I am testing the latest changes. So far I found a failure when there is a rule with QSA but no querystring is supplied. It is probably due to the absence of null check in QSA handling code.

I will let you know after testing more thoroughly.
Comment 35 Santhana Preethi 2016-09-14 11:35:52 UTC
Created attachment 34247 [details]
failure case for QSA
Comment 36 Santhana Preethi 2016-09-20 05:16:20 UTC
Any update in QSA issue?
Comment 37 Tiago Oliveira 2016-09-20 18:05:01 UTC
Created attachment 34281 [details]
fix for failure case (QSA)


this is a small fix for the test "failure case for QSA (attachment 34247 [details])" provided by Santhana Preethi (comment 35)
Comment 38 Mark Thomas 2016-09-20 19:21:49 UTC
Comment on attachment 34281 [details]
fix for failure case (QSA)

Thanks variation of the patch applied.
Comment 39 Santhana Preethi 2016-09-21 12:54:59 UTC
Thanks for the fix.
Comment 40 Mark Thomas 2016-09-21 13:35:02 UTC
This has been fixed in the following branches:
- 9.0.x for 9.0.0.M11 onwards
- 8.5.x for 8.5.6 onwards
- 8.0.x for 8.0.38 onwards