Bug 34985

Summary: utf8 to ucs2 conversion failed on Windows
Product: Apache httpd-2 Reporter: ernesto <ernestoname>
Component: mod_cgiAssignee: Apache HTTPD Bugs Mailing List <bugs>
Severity: normal CC: gazerro, rd9
Priority: P2    
Version: 2.1-HEAD   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   

Description ernesto 2005-05-20 15:12:07 UTC
[Thu May 19 13:38:00 2005] [error] [client] (22)Invalid argument: 
couldn't spawn child process: C:/php/php.exe, referer: 
(22)Invalid argument: utf8 to ucs2 conversion failed on this string: 
Comment 1 William A. Rowe Jr. 2005-08-29 22:16:57 UTC
  Dude - if you are running mod_cgid on Win32 then all bets are off :)

  And I'm totally clueless, but I guess my first question is why use php.exe
  as a CGI when you can plug it in as a module, and actually serve pages without
  warming up your cpu?  CGI is a very disk/cpu/kernel intensive way to serve
  any content whatsoever.
Comment 2 Richard D 2006-11-04 02:15:56 UTC
This looks like a variant of Bug 32730 which had the same issues on Windows with
some different environment variables.  The problem is that Apache tries to
translate every environment variable from Unicode's UTF-8 encoding into UCS-2,
even though the environment variable may be in another character encoding (e.g.
ISO-8859-1 aka Latin-1).

An extension of the fix for Bug 32730 should work, although the real solution

This is not specific to mod_cgi and PHP, as it happens with non-PHP CGI
programs. CGI is still a reasonable option in some cases, e.g. for development
of CGI scripts on Windows for installation on Linux+CGI (or a production
mod_perl server on any OS).
Comment 3 Richard D 2006-11-04 02:22:23 UTC
Got interrupted when writing last comment, sorry...  

To finish the incomplete sentence in that comment: the real solution in my view
is to go through all environment variables that could be non-UTF8 (virtually
anything that is a string) and avoid converting those - or, better, only convert
those guaranteed not to be strings, or guaranteed to be ASCII only.  Another
environment variable with this problem is REDIRECT_URL, logged in comment to Bug
32730 after fix was committed.  This is a fairly simple extension of the patch I
submitted for that bug.

A configuration directive to turn off this conversion might also be useful.
Comment 4 Richard D 2006-11-04 02:58:21 UTC
Some more variants of this bug...

Bug 13029 is another variant for the environment variable SSL_SERVER_S_DN_L.  I
think the fundamental issue is that there's no way to turn off this UTF-8 to
UCS-2 conversion, and it only happens on Windows, well before any CGI script or
other code has a chance to do its own non-UTF-8 based conversion.  

The REDIRECT_QUERY_STRING variant was also reported at

Comment 5 William A. Rowe Jr. 2006-11-04 09:46:25 UTC
Yes - it looks like this needs to be more tollerant, overall, of non-utf8
data, and I'll look at rolling in a solution that doesn't impact security.

Thanks for your observations, they appear spot-on.
Comment 6 Richard D 2006-11-04 12:14:04 UTC
Not sure what you mean by security implications, but I don't think that falling
back to another encoding such as ISO-8859-1 is necessary.

Taking TWiki as an example, which uses paths like /bin/view/Main/WebHome, where
view is the CGI script, and /Main/WebHome is the PATH_INFO (see
for example of CGI environment variables), it would be useful to specify the
following to handle non-UTF-8 encodings such as ISO-8859-1 (which are used by
POST from Firefox currently):

PATH	Convert (since it has pathnames)
QUERY_STRING	Raw (not a filename, should be interpreted by application)
REQUEST_URI	Convert if valid UTF-8 (and not overlong encoding)
SCRIPT_FILENAME	Convert if valid UTF-8 (and not overlong encoding)
SCRIPT_NAME	Convert if valid UTF-8 (and not overlong encoding)
(rest are all raw)

Basically, only those variables that correspond to filenames should be
converted, and then only if they are valid UTF-8 without overlong encoding.

Any variables not used by Apache should not be converted, but left to the
application, or a suitable add-on Apache module for conversion.

TWiki has done its own interpretation of UTF-8 URLs, independent of the OS it is
running on, which is based on a technique used by IBM's web server for mainframe
(z/OS) - basically it tries to recognise the URL as UTF-8 and then falls back to
the native encoding (i.e. no conversion done at all).  In fact we do this on the
PATH_INFO ourselves.

If Apache is going to carry on doing its own UTF-8 to UCS-2 conversion, which I
suppose it must do in some cases that map onto a Windows filesystem (and others
such as MacOS X HFS+ etc), it would be good if it recognises when data is really
UTF-8 in this way.  Also, it would be very helpful to have a configuration
option that lets you say "don't convert variable X if it matches regex Y", e.g.
don't convert PATH_INFO if it matches "/twiki/bin/.*"

Some TWiki pages that might be of interest here are:

http://twiki.org/cgi-bin/view/Codev/EncodeURLsWithUTF8 - how TWiki does
auto-detection and conversion of UTF-8 encoding for PATH_INFO in URLs

http://twiki.org/cgi-bin/view/Codev/InternationalisationUTF8 - includes material
on character set auto-detection including excerpt on IBM web server approach -
fortunately UTF-8 detection is much easier than the general case.

http://twiki.org/cgi-bin/view/Codev/MacOSXFilesystemEncodingWithI18N - talks
about a filesystem-related issue with Unicode normalisation forms on Mac OS X 

http://twiki.org/cgi-bin/view/Codev/ProposedUTF8SupportForI18N - general page
summarising research on UTF-8 for TWiki, including some useful links

Comment 7 Preben Nilsson 2007-03-05 02:21:09 UTC
Hi all,

We are implementing an application, that uses SSL client certificates. And it 
seems like we are running into the same problem that it descriped here:

[Mon Mar 05 09:48:34 2007] [error] [client] File does not exist: 
(22)Invalid argument: utf8 to ucs2 conversion failed on this string: 
SSL_CLIENT_S_DN_CN=Anette Birgitte Franzp\xf8tter

Is there a way, that I can work around this problem ?
Best regards
Preben Nilsson
Comment 8 William A. Rowe Jr. 2007-12-22 13:06:29 UTC

*** This bug has been marked as a duplicate of 13029 ***