When you have a page named with a german umlaut, Apache 2.0.x isn't able to serve this page. You get an error-message: 403 - access denied. OS: Windows NT, 2000, probably XP (English and german OS). I don't know if this problem is also seen on non-Windows OS. tested Apache-Versions: 2.0.42 - 2.0.48 How to test: Rename an existing html-file in your Documentroot to "ä.html". Open a browser on your server-PC (both following versions doesn't work): http://localhost/ä.html http://localhost/%E4.html
I searched RFCs for using national characters in URLs like ä,ö,ü and if it's allowed to encode them in an http url. I found that there could be a problem in Apache2 regarding RFC specification - but I did not found anything about national encoded characters "can" or "must not" be used in http URLs. RFC2396 [7] 2.1 Summary: Special Characters that are not represented in the octet for the US-ASCII code require some way of identifying the charset used. Part of the referenced RFC2277 [2] 3.1: "All protocols MUST identify, for all character data, which charset is in use." Even if there is no problem regarding RFC specification, the error message is misleading. %E4.html - access denied -> wrong message %E4.html - not found -> could be the correct message if national characters are not allowed. "access denied" also occurs if ä.html does not exist. Compared with Apache1: Apache1 is able to send %E4.html back to the browser.
It's a problem on windows. If you use non-ascii characters in URLs, you have to encode the as UTF-8 and then apply url-encoding. %E4 will be mapped to 'ä' and not match the requirements of an unicode filesystem. BTW: It's recommend to use url encoded UTF-8 all the time in URLs, since there is no way to declare the charset in URLs, The 403 comes from the translator uri -> filename (since no valid filename, is it not found or is it forbidden? ;-). In newer versions of apache it should write an entry into the errorlog according to this failure.
You told me: %E4 will be mapped to 'ä' and not match the requirements of an unicode filesystem. But in the unicode character set 'ä' is a valid chacter and is mapped to %E4. You asked about the error message: 'access denied' is always used by Apache 2 when trying to get an url containing %E4. It doesn't matter if the file exists or not. The access to the file isn't forbidden - as I described Apache 1 is able to serve it (on the same server with system-rights). Using loglevel 'debug' or loglevel 'emerg' I can only find following entry regarding the problem: access.log: ... "GET /%E4.html HTTP/1.1" 403 1153 error.log: - Perhaps it's the same problem described in bug 15133 - but this belongs to Axis.
I said also, that it's recommended to use url encoded UTF-8 all the time in URLs, since there is no way to declare the charset in URLs %E4 is NOT URL-encoded UTF-8. %C3%A4 is.
Ok, but we have to find a working solution on both Apache-Versions. Apache1: .../%E4.html -> ä.html is displayed .../%C3%A4.html -> ä.html not found Apache2: .../%E4.html -> access denied .../%C3%A4.html -> ä.html is displayed For 'encoding' tests, I used IE 5.5, IE6 and Mozilla 1.5 on Win32: IE: ä.html -> \xE4.html Mozilla: ä.html -> %E4.html Apache1: %E4.html -> ä.html \xE4.html -> ä.html Apache2: %E4.html -> access denied \xE4.html -> access denied Apache2 wants UTF-8 Encoding - no current and widely known browser supports this in the initial configuration. It would be a lot easier if Apache2 would support unicode like Apache1 and the current browsers. Without unicode support Apache2 is impossible to use with german sites, expecially having more than 100 WEB-authors not knowing this problem.
I don't understand, what you expect for URLs containing non-ascii characters. Such URLs are invalid. So the behaviour of the client and the server is _undefined_ (read: everything can happen).
>> I don't understand, what you expect... What I expect then using an 'invalid' character like %E4 in an URL: Not an 'undefined' reaction of the server, but one of the following solutions: - get a correct error message Not: 'access denied' Better: 'File not found' or 'invalid URL'. - add unicode-support to Apache2. Either serve UTF-8 and Unicode files every time (like MS-IIS) or set a switch in a conf-file to specify which of both methods to use: special_character_handling = UTF-8 | Unicode | Both to be compatible with Apache1 and the various browsers.
%E4 is not an invalid character. %E4 is a sequence of 3 characters, representing the octet with value of 0xe4. No charset information given. Apache 2 which does support Unicode on Windows cannot map it to one charset. A correct representation of teh german a umlaut is %C3%A4 (as already said). This sequence represents two octets which are valid UTF-8 encoding for the german a umlaut in Unicode. The underlying filesystem (NTFS) can handle it. That's it. Your distinction between UTF-8 and Unicode makes no sense, because UTF-8 is a representation of Unicode.