Bug 34985 - utf8 to ucs2 conversion failed on Windows
Summary: utf8 to ucs2 conversion failed on Windows
Status: RESOLVED DUPLICATE of bug 13029
Alias: None
Product: Apache httpd-2
Classification: Unclassified
Component: mod_cgi (show other bugs)
Version: 2.1-HEAD
Hardware: PC Windows XP
: P2 normal with 5 votes (vote)
Target Milestone: ---
Assignee: Apache HTTPD Bugs Mailing List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-05-20 15:12 UTC by ernesto
Modified: 2007-12-22 13:06 UTC (History)
2 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description ernesto 2005-05-20 15:12:07 UTC
[Thu May 19 13:38:00 2005] [error] [client 10.1.5.91] (22)Invalid argument: 
couldn't spawn child process: C:/php/php.exe, referer: 
http://digei06/Seige/InformacionRelevante/Cuadros.phtml?Dependencia=Direccion%
20del%20SEIGE&url=http://seplader2.qroo.gob.mx/seige
(22)Invalid argument: utf8 to ucs2 conversion failed on this string: 
REDIRECT_QUERY_STRING=Sector=Seguridad%20y%20Orden%20P\xfablico
Comment 1 William A. Rowe Jr. 2005-08-29 22:16:57 UTC
  Dude - if you are running mod_cgid on Win32 then all bets are off :)
  Reclassifying.

  And I'm totally clueless, but I guess my first question is why use php.exe
  as a CGI when you can plug it in as a module, and actually serve pages without
  warming up your cpu?  CGI is a very disk/cpu/kernel intensive way to serve
  any content whatsoever.
Comment 2 Richard D 2006-11-04 02:15:56 UTC
This looks like a variant of Bug 32730 which had the same issues on Windows with
some different environment variables.  The problem is that Apache tries to
translate every environment variable from Unicode's UTF-8 encoding into UCS-2,
even though the environment variable may be in another character encoding (e.g.
ISO-8859-1 aka Latin-1).

An extension of the fix for Bug 32730 should work, although the real solution

This is not specific to mod_cgi and PHP, as it happens with non-PHP CGI
programs. CGI is still a reasonable option in some cases, e.g. for development
of CGI scripts on Windows for installation on Linux+CGI (or a production
mod_perl server on any OS).
Comment 3 Richard D 2006-11-04 02:22:23 UTC
Got interrupted when writing last comment, sorry...  

To finish the incomplete sentence in that comment: the real solution in my view
is to go through all environment variables that could be non-UTF8 (virtually
anything that is a string) and avoid converting those - or, better, only convert
those guaranteed not to be strings, or guaranteed to be ASCII only.  Another
environment variable with this problem is REDIRECT_URL, logged in comment to Bug
32730 after fix was committed.  This is a fairly simple extension of the patch I
submitted for that bug.

A configuration directive to turn off this conversion might also be useful.
Comment 4 Richard D 2006-11-04 02:58:21 UTC
Some more variants of this bug...

Bug 13029 is another variant for the environment variable SSL_SERVER_S_DN_L.  I
think the fundamental issue is that there's no way to turn off this UTF-8 to
UCS-2 conversion, and it only happens on Windows, well before any CGI script or
other code has a chance to do its own non-UTF-8 based conversion.  

The REDIRECT_QUERY_STRING variant was also reported at
http://mail-archives.apache.org/mod_mbox/httpd-users/200504.mbox/%3c006901c536e0$3dd72010$5d01250a@vdm%3e


Comment 5 William A. Rowe Jr. 2006-11-04 09:46:25 UTC
Yes - it looks like this needs to be more tollerant, overall, of non-utf8
data, and I'll look at rolling in a solution that doesn't impact security.

Thanks for your observations, they appear spot-on.
Comment 6 Richard D 2006-11-04 12:14:04 UTC
Not sure what you mean by security implications, but I don't think that falling
back to another encoding such as ISO-8859-1 is necessary.

Taking TWiki as an example, which uses paths like /bin/view/Main/WebHome, where
view is the CGI script, and /Main/WebHome is the PATH_INFO (see
http://twiki.org/cgi-bin/viewfile/Support/ApacheErrorsDuringEdit?rev=1.1;filename=testenv.htm
for example of CGI environment variables), it would be useful to specify the
following to handle non-UTF-8 encodings such as ISO-8859-1 (which are used by
POST from Firefox currently):

AUTH_TYPE	Raw
DOCUMENT_ROOT	Convert 
GATEWAY_INTERFACE	Raw 
HTTP_ACCEPT	Raw
HTTP_ACCEPT_CHARSET	Raw
HTTP_ACCEPT_ENCODING	Raw
HTTP_ACCEPT_LANGUAGE	Raw
HTTP_CONNECTION	Raw
HTTP_HOST	Raw
HTTP_KEEP_ALIVE	Raw
HTTP_USER_AGENT	Raw
PATH	Convert (since it has pathnames)
QUERY_STRING	Raw (not a filename, should be interpreted by application)
REMOTE_ADDR	Raw
REMOTE_PORT	Raw
REMOTE_USER	Raw
REQUEST_METHOD	Raw
REQUEST_URI	Convert if valid UTF-8 (and not overlong encoding)
SCRIPT_FILENAME	Convert if valid UTF-8 (and not overlong encoding)
SCRIPT_NAME	Convert if valid UTF-8 (and not overlong encoding)
SERVER_ADDR	Raw
SERVER_ADMIN    Raw
....
(rest are all raw)

Basically, only those variables that correspond to filenames should be
converted, and then only if they are valid UTF-8 without overlong encoding.

Any variables not used by Apache should not be converted, but left to the
application, or a suitable add-on Apache module for conversion.

TWiki has done its own interpretation of UTF-8 URLs, independent of the OS it is
running on, which is based on a technique used by IBM's web server for mainframe
(z/OS) - basically it tries to recognise the URL as UTF-8 and then falls back to
the native encoding (i.e. no conversion done at all).  In fact we do this on the
PATH_INFO ourselves.

If Apache is going to carry on doing its own UTF-8 to UCS-2 conversion, which I
suppose it must do in some cases that map onto a Windows filesystem (and others
such as MacOS X HFS+ etc), it would be good if it recognises when data is really
UTF-8 in this way.  Also, it would be very helpful to have a configuration
option that lets you say "don't convert variable X if it matches regex Y", e.g.
don't convert PATH_INFO if it matches "/twiki/bin/.*"

Some TWiki pages that might be of interest here are:

http://twiki.org/cgi-bin/view/Codev/EncodeURLsWithUTF8 - how TWiki does
auto-detection and conversion of UTF-8 encoding for PATH_INFO in URLs

http://twiki.org/cgi-bin/view/Codev/InternationalisationUTF8 - includes material
on character set auto-detection including excerpt on IBM web server approach -
fortunately UTF-8 detection is much easier than the general case.

http://twiki.org/cgi-bin/view/Codev/MacOSXFilesystemEncodingWithI18N - talks
about a filesystem-related issue with Unicode normalisation forms on Mac OS X 

http://twiki.org/cgi-bin/view/Codev/ProposedUTF8SupportForI18N - general page
summarising research on UTF-8 for TWiki, including some useful links





Comment 7 Preben Nilsson 2007-03-05 02:21:09 UTC
Hi all,

We are implementing an application, that uses SSL client certificates. And it 
seems like we are running into the same problem that it descriped here:

[Mon Mar 05 09:48:34 2007] [error] [client 195.7.31.10] File does not exist: 
C:/bec_was/servletpif/apache2/docroots/errordocs
(22)Invalid argument: utf8 to ucs2 conversion failed on this string: 
SSL_CLIENT_S_DN_CN=Anette Birgitte Franzp\xf8tter

Is there a way, that I can work around this problem ?
Best regards
Preben Nilsson
Comment 8 William A. Rowe Jr. 2007-12-22 13:06:29 UTC

*** This bug has been marked as a duplicate of 13029 ***