Bug 53554

Summary: Wrong case for hexadecimal percent encoding [patch]
Product: Apache httpd-2 Reporter: Tim Starling <tstarling>
Component: mod_rewriteAssignee: Apache HTTPD Bugs Mailing List <bugs>
Status: NEW ---    
Severity: normal    
Priority: P2    
Version: 2.5-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Use uppercase hexadecimal digits in mod_rewrite

Description Tim Starling 2012-07-16 23:55:18 UTC
Created attachment 29069 [details]
Use uppercase hexadecimal digits in mod_rewrite

Apache mod_rewrite encodes special characters using lowercase hexadecimal digits, for example Chráněná becomes Chr%c3%a1n%c4%9bn%c3%a1 instead of Chr%C3%A1n%C4%9Bn%C3%A1. The use of a non-canonical URL breaks our caching system. We can't use lowercase hexadecimal digits as our canonical URLs because no browser sends URLs like that, so the cache would be even more badly broken. Please use uppercase hexadecimal digits in URLs.
Comment 1 Christophe JAILLET 2012-09-30 06:26:35 UTC
In RFC 1738, about Uniform Resource Locators (URL)
(http://www.rfc-editor.org/rfc/rfc1738.txt)


it is written that :

>>>
2.2. URL Character Encoding Issues

[...]
In addition, octets may be encoded by a character triplet consisting
of the character "%" followed by the two hexadecimal digits (from
"0123456789ABCDEF") which forming the hexadecimal value of the octet.
(The characters "abcdef" may also be used in hexadecimal encodings.)
[...]

<<<


So, I guess that httpd is correct when encoding with lower case.


I left the report open, just in case, but I think that it should be marked as FIXED, WONTFIX.
Comment 2 Tim Starling 2012-10-01 04:54:06 UTC
(In reply to comment #1)
> In RFC 1738, about Uniform Resource Locators (URL)
> (http://www.rfc-editor.org/rfc/rfc1738.txt)
> 
> 
> it is written that :
> 
> >>>
> 2.2. URL Character Encoding Issues
> 
> [...]
> In addition, octets may be encoded by a character triplet consisting
> of the character "%" followed by the two hexadecimal digits (from
> "0123456789ABCDEF") which forming the hexadecimal value of the octet.
> (The characters "abcdef" may also be used in hexadecimal encodings.)
> [...]
> 
> <<<
> 
> 
> So, I guess that httpd is correct when encoding with lower case.
> 
> 
> I left the report open, just in case, but I think that it should be marked
> as FIXED, WONTFIX.

I think the RFC is pretty clear about which encoding is preferred, and it's not the one httpd is using. You seem to be using a very loose definition of "correct". There are two ways of doing it: one is preferred, the other is idiosyncratic and breaks caching. It is a simple change and the patch is attached.
Comment 3 Wim Lewis 2013-03-19 23:15:22 UTC
Apache is not incorrect here; the cache is not performing its job as well as it could: a well-written cache would compare URLs more intelligently than just a simple string compare.

The RFC does say that software should encode URLs with upper-case hex encoding, though, and many clients do have bugs like this one when it comes to comparing URLs, so I think it would be reasonable for apache to change its behavior here. ("Be strict in what you produce, but liberal in what you accept", and all that.)

http://tools.ietf.org/html/rfc3986#section-6.2 has more discussion on URL comparison and normalization.