Issue 25416

Summary: Bug in Encoding of Hebrew letters in filename
Product: porting Reporter: alan
Component: codeAssignee: tino.rachui
Status: CLOSED FIXED QA Contact: issues@porting <issues>
Severity: Trivial    
Priority: P3 CC: asari, bjoern.zessack, issues, smokey.ardisson, xslf
Version: OOo 1.1 RC5   
Target Milestone: OOo 2.0   
Hardware: Mac   
OS: Mac OS X, all   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
Patches for sal/inc/osl/thread.h, sal/osl/unx/*.c, and sal/osl/unx/*.cxx
none
Patch just tested under Panther!
none
Hack in file.cxx no longer needed with osxlocale patch none

Description alan 2004-02-12 15:03:39 UTC
In TextEdit, if we save a file with Hebrew letters in the name,
the Hebrew letters are encoded as two-byte characters. The first byte is
the hex number D7, and the second one is different for each character. 
        In OOo, things are different. OOo has its own file URL's, and in
this file URL, the encoding for the Hebrew letters is the same as in
TextEdit, namely 2-byte characters, the first of which is D7. So
far so good. However, before the file actually gets written, OOo converts
this URL to a pathname for the file system, and in the process, changes
the encoding. The Hebrew letters now become one-byte chars, and that byte
is not identical with the second byte of the other encoding. As a result,
I'm unable to save a Hebrew file. I get an error message the the path is not
found.
Example:
        The file URL :
	file:///Users/oleg/Documents/%D7%90%D7%9C%D7%9F.sxw
        gets converted to:
	/Users/oleg/Documents/\0xd0\0xdc\0xdf.sxw
        where 	\0xd0\0xdc\0xdf are three characters with the hex values d0,
dc,df.
    
        For now, we put in a kludge. Namely, in osl_openFile
(src/sal/osl/unx/file.c) before calling open(buffer, flags, mode), we check
buffer for Hebrew letters, and convert them back to the two-byte encoding. This
allows the file to be saved. However, we still get the error message "<filename>
not found", probably because this conversion has to take place in other parts of
the code as well. How do we really fix this bug?
Comment 1 Martin Hollmichel 2004-02-20 11:13:02 UTC
mh->ayaniger: is this a MacOSX problem only ?

reassigned.
Comment 2 tino.rachui 2004-02-23 07:43:58 UTC
See also discussion (Alan Yaniger) on openoffice.porting.dev from 02/17/2004 for
additional information
Comment 3 alan 2004-03-03 12:46:27 UTC
Attached is a file with patches for *.c and *.cxx files in sal/osl/unx. There is
also a patch for sal/inc/osl/thread.h
Comment 4 alan 2004-03-03 12:49:25 UTC
Created attachment 13553 [details]
Patches for sal/inc/osl/thread.h, sal/osl/unx/*.c, and sal/osl/unx/*.cxx
Comment 5 terryt 2004-03-04 09:53:33 UTC
Interestingly I got a separate EMail today from Boris Reznik in Israel claiming 
that while my "Start OpenOffice.org" launcher did not support opening files 
with Hebrew letters in the name, the "CoooL" launcher did work. However Boris 
didn't state what version of OOo he was using - I have to suspect OOo 1.0.3GM. 
If OOo requires source changes to support Hebrew filenames, then I can't see 
how "CoooL" would be working. However it does seem this bug report is 
discussing SAVE rather than OPEN.
Comment 6 thorsten.ziehm 2004-03-12 13:17:27 UTC
Because of limited resources for OOo1.1.2 we decided to shift this task to OOo2.0.
Comment 7 tino.rachui 2004-05-10 10:45:45 UTC
Please have a look at #i28928#
Kind Regards,
Tino
Comment 8 alan 2004-05-10 13:59:26 UTC
Hi Tino,

I looked at the issue 28928, and it wasn't obvious to me how it was relevant to
this issue. Could you explain more fully?

Thanks,
Alan
Comment 9 tino.rachui 2004-05-10 20:14:28 UTC
Hi Alan,

well sal converts file names which it gets from the system to UTF8. Because no
encoding is linked to such a system file name sal uses the current thread text
encoding for the conversion. But some systems always use a specific encoding for
file names (UTF8 for instance as in the current case), in this case the sal
conversion fails as we saw. On the other hand we cannot patch
osl_getThreadTextEncoding to always deliver the encoding used at the file system
interface as this function is even used in cases which have nothing to do with
the aforementioned issues and where we want indeed the current thread text
encoding. That's why the proposal to introduce a pair of new functions to set
the encoding which will be used for file name to file url conversion. In the
concrete case this function would return UTF8 for instance.

HTH,
Tino
Comment 10 hdu@apache.org 2004-05-11 12:45:58 UTC
Tino, you probably wanted to mention issue 28982 instead of 28928 :-)
Comment 11 tino.rachui 2004-05-17 20:44:43 UTC
*** Issue 16281 has been marked as a duplicate of this issue. ***
Comment 12 sforbes 2004-05-18 05:00:18 UTC
Well, if a Linux bug was marked as a dup of a Mac bug, then the PLATFORM and OS
need to be changed to ALL
Comment 13 sforbes 2004-05-18 05:00:59 UTC
*** Issue 29224 has been marked as a duplicate of this issue. ***
Comment 14 pluby 2004-08-23 05:20:23 UTC
I agree with Tino's point that changing how osl_getThreadTextEncoding() works
will cause other things to break.

The better solution (and one that I have been using in released versions
NeoOffice/J) is to #define osl_getThreadTextEncoding() RTL_TEXTENCODING_UTF8
when MACOSX is defined in the following sal/osl/unx files:

file.c
module.c
pipe.c
process.c
process_impl.cxx
profile.c
security.c
tempfile.c
uunxapi.cxx
Comment 15 tino.rachui 2004-08-26 07:28:50 UTC
Hi *,

a more concrete proposal: 

We would like to introduce two new functions in sal
osl_setFileSystemEncoding
osl_getFileSystemEncoding

These functions deliver the encoding which should be used for encoding/decoding
system paths to or from file urls. For platforms which are using a fixed
encoding these function could well deliver the required encoding while on other
systems the functions could just call osl_getThreadTextEncoding to get an
encoding. In the desktop project there is some code which detects specific
desktop environments like Gnome, etc. this would be a good place to set the to
be used file system encoding if necessary. Hopefully I fix this bug before OOo
2.0 beta. I will propose this sal extension on openoffice.interface-discuss too.
Comment 16 tino.rachui 2004-10-26 11:36:16 UTC
Hi Alan,

I played a little bit with a Mac (though my Mac knowledge is very limited) in
order to investigate the problem with regards to this bug. To me it seems that
the problem has something to do with a "misconfigured" system. It would be nice
if some Mac guru's could verify my findings and maybe suggest some fixes which
might be more appropriate than the suggested fix to overwrite
osl_getThreadTextEncoding in the osl file system interface. It is known that osl
uses osl_getThreadTextEncoding in order to get an encoding used for converting
system paths to file URLs and vice versa. osl_getThreadTextEncoding will be
initialized by a function osl_getProcessLocale which calls a fuunction
_imp_getProcessLocale (see osl/unx/nlsupport.c). This function basically looks
like follows:

void _imp_getProcessLocale(...)
{
    /* set the locale defined by the env vars */
    char* locale = setlocale( LC_CTYPE, "" );
    
    /* fallback to the current locale */
    if( NULL == locale )
        locale = setlocale( LC_CTYPE, NULL );

    /* return the LC_CTYPE locale */
    *ppLocale = _parse_locale( locale );
}

If the function fails to provide a valid locale the "C" locale will be used by
sal/osl (see _parse_locale in the same file).  It seems that under MacOS X the
"C" locale is always active no matter which language is configured which would
be a reasonable explanation for the problems on Hebrew systems. 
Does MacOS X have means to query the currently configured locale and wouldn't it
be more useful to implement osl_getProcessLocale Mac specific? I'm happily
willing to accept and integrate patches into sal. If there is no better patch
than the currently suggested one we can also take this one.

Kind Regards,
Tino
Comment 17 tino.rachui 2005-01-17 10:22:22 UTC
Because of limited resources deferred to OOo later.
Comment 18 tino.rachui 2005-06-23 08:19:42 UTC
Meanwhile I've got a Mac of my own and can pick up the problem. As described
already the problem is that a Mac specific way for detecting the system locale
is necessary. The Mac has an own API for this. It is necessary  that a '.UTF-8'
will be appended to each returned locale e.g. 'en_US.UTF-8' because this part
will be used to determine which encoding shall be used for encoding/decoding
file names.
Comment 19 tino.rachui 2005-06-27 08:30:09 UTC
*** Issue 46963 has been marked as a duplicate of this issue. ***
Comment 20 tino.rachui 2005-06-27 08:35:33 UTC
Created attachment 27507 [details]
Patch just tested under Panther!
Comment 21 tino.rachui 2005-06-27 08:37:31 UTC
Created attachment 27508 [details]
Hack in file.cxx no longer needed with osxlocale patch
Comment 22 tino.rachui 2005-06-27 08:51:50 UTC
Platform -> 'Macintosh'
OS -> 'Mac OS X'
Comment 23 tino.rachui 2005-07-05 09:06:00 UTC
Fixed on cws macosx10
Comment 24 eric.bachard 2005-07-13 22:41:59 UTC
Verified with m112 / Mac OSX Tiger 
Comment 25 jjmckenzie 2005-07-14 02:04:46 UTC
Patches for 27 June will only work with OOo 1.9_m series, not with SRX645.
Appropriate patches for SRX645 are in macxjoin1153.
Comment 26 eric.bachard 2005-07-23 23:02:17 UTC
*** Issue 50503 has been marked as a duplicate of this issue. ***
Comment 27 maho.nakata 2005-07-26 02:41:25 UTC
thanks I can input Japanese with this patch:
verified with:
1.9m119/kinput2.macim
Comment 28 maho.nakata 2005-07-26 02:42:26 UTC
thanks I can input Japanese with this patch:
verified with:
1.9m119/kinput2.macim.
Comment 29 asari 2005-08-01 03:44:49 UTC
Compile fails for me, saying:

=============
Building project udkapi
=============
/sw/src/fink.build/openoffice.org-ja-1.9m121-50/udkapi/com/sun/star
mkout -- version: 1.4
idlc @/tmp/mkDmCDfr
Could not get Canonical Locale Identifier from AppleLanguages value!
Bus error
dmake:  Error code 138, while making '../../../unxmacxp.pro/misc/urd_css.don'
'---* tg_merge.mk *---'

ERROR: Error 65280 occurred while making
/sw/src/fink.build/openoffice.org-ja-1.9m121-50/udkapi/com/sun/star
dmake:  Error code 1, while making 'build_all'


In my environment (Tiger),

$ defaults read 'Apple Global Domain' AppleLanguages

returns:

The domain/default pair of (kCFPreferencesAnyApplication, AppleLanguages) does
not exist

Any helps?
Comment 30 tino.rachui 2005-08-24 07:05:33 UTC
TRA: Verified on master -> ok. Closing issue.