Bug 24333

Summary: Error 403 - when URL with a german umlaut is used
Product: Apache httpd-2 Reporter: Harald Schwarz <haschwar>
Component: CoreAssignee: Apache HTTPD Bugs Mailing List <bugs>
Severity: normal Keywords: ErrorMessage, RFC
Priority: P3    
Version: 2.0.48   
Target Milestone: ---   
Hardware: PC   
OS: All   

Description Harald Schwarz 2003-11-02 19:19:22 UTC
When you have a page named with a german umlaut, Apache 2.0.x isn't able to
serve this page. You get an error-message:
403 - access denied.

OS: Windows NT, 2000, probably XP (English and german OS).
    I don't know if this problem is also seen on non-Windows OS.
tested Apache-Versions: 2.0.42 - 2.0.48
How to test: Rename an existing html-file in your Documentroot to "ä.html".
Open a browser on your server-PC (both following versions doesn't work):
Comment 1 Harald Schwarz 2003-11-07 12:08:13 UTC
I searched RFCs for using national characters in URLs like ä,ö,ü and if it's
allowed to encode them in an http url.

I found that there could be a problem in Apache2 regarding RFC specification -
but I did not found anything about national encoded characters "can" or "must
not" be used in http URLs.

RFC2396 [7] 2.1
Summary: Special Characters that are not represented in the octet for the
US-ASCII code require some way of identifying the charset used.

Part of the referenced RFC2277 [2] 3.1:
"All protocols MUST identify, for all character data, which charset is
in use."

Even if there is no problem regarding RFC specification, the error message is
%E4.html - access denied -> wrong message
%E4.html - not found     -> could be the correct message if
                            national characters are not allowed.
"access denied" also occurs if ä.html does not exist.
Compared with Apache1: Apache1 is able to send %E4.html back to the browser.
Comment 2 André Malo 2003-11-07 12:21:33 UTC
It's a problem on windows. If you use non-ascii characters in URLs, you have to
encode the as UTF-8 and then apply url-encoding. %E4 will be mapped to 'ä' and
not match the requirements of an unicode filesystem.
BTW: It's recommend to use url encoded UTF-8 all the time in URLs, since there
is no way to declare the charset in URLs, 

The 403 comes from the translator uri -> filename (since no valid filename, is
it not found or is it forbidden? ;-). In newer versions of apache it should
write an entry into the errorlog according to this failure.
Comment 3 Harald Schwarz 2003-11-10 09:18:16 UTC
You told me:
%E4 will be mapped to 'ä' and not match the requirements of an unicode filesystem.

But in the unicode character set 'ä' is a valid chacter and is mapped to %E4.

You asked about the error message:
'access denied' is always used by Apache 2 when trying to get an url containing
%E4. It doesn't matter if the file exists or not.
The access to the file isn't forbidden - as I described Apache 1 is able to
serve it (on the same server with system-rights).

Using loglevel 'debug' or loglevel 'emerg' I can only find following entry
regarding the problem:
access.log:  ... "GET /%E4.html HTTP/1.1" 403 1153
error.log:   -

Perhaps it's the same problem described in bug 15133 - but this belongs to Axis.
Comment 4 André Malo 2003-11-10 09:53:43 UTC
I said also, that it's recommended to use url encoded UTF-8 all the time in
URLs, since there is no way to declare the charset in URLs

%E4 is NOT URL-encoded UTF-8. %C3%A4 is.
Comment 5 Harald Schwarz 2003-11-10 12:16:05 UTC
Ok, but we have to find a working solution on both Apache-Versions.

.../%E4.html    -> ä.html is displayed
.../%C3%A4.html -> ä.html not found

.../%E4.html    -> access denied
.../%C3%A4.html -> ä.html is displayed

For 'encoding' tests, I used IE 5.5, IE6 and Mozilla 1.5 on Win32:
IE:      ä.html    -> \xE4.html
Mozilla: ä.html    -> %E4.html
Apache1: %E4.html  -> ä.html
         \xE4.html -> ä.html
Apache2: %E4.html  -> access denied
         \xE4.html -> access denied

Apache2 wants UTF-8 Encoding - no current and widely known browser supports this
in the initial configuration.
It would be a lot easier if Apache2 would support unicode like Apache1 and the
current browsers. Without unicode support Apache2 is impossible to use with
german sites, expecially having more than 100 WEB-authors not knowing this problem.
Comment 6 André Malo 2003-11-12 21:54:50 UTC
I don't understand, what you expect for URLs containing non-ascii characters.
Such URLs are invalid. So the behaviour of the client and the server is
_undefined_ (read: everything can happen).
Comment 7 Harald Schwarz 2003-11-13 07:57:05 UTC
>> I don't understand, what you expect...

What I expect then using an 'invalid' character like %E4 in an URL:
Not an 'undefined' reaction of the server, but
one of the following solutions:

- get a correct error message
  Not:    'access denied'
  Better: 'File not found' or 'invalid URL'.

- add unicode-support to Apache2.
  Either serve UTF-8 and Unicode files every time (like MS-IIS)
  or set a switch in a conf-file to specify which of both methods to use:
  special_character_handling = UTF-8 | Unicode | Both
  to be compatible with Apache1 and the various browsers.

Comment 8 André Malo 2003-11-13 09:56:35 UTC
%E4 is not an invalid character. %E4 is a sequence of 3 characters, representing
the octet with value of 0xe4. No charset information given. Apache 2 which does
support Unicode on Windows cannot map it to one charset. A correct
representation of teh german a umlaut is %C3%A4 (as already said). This sequence
represents two octets which are valid UTF-8 encoding for the german a umlaut in
Unicode. The underlying filesystem (NTFS) can handle it. That's it.

Your distinction between UTF-8 and Unicode makes no sense, because UTF-8 is a
representation of Unicode.