Issue 69973 - filenames with non native encoding cannot be loaded
Summary: filenames with non native encoding cannot be loaded
Status: CONFIRMED
Alias: None
Product: General
Classification: Code
Component: ui (show other issues)
Version: OOo 2.0.2
Hardware: All Linux, all
: P3 Trivial with 6 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: oooqa
: 90262 101322 108790 112293 (view as issue list)
Depends on:
Blocks:
 
Reported: 2006-09-29 11:08 UTC by maisonobe
Modified: 2017-05-20 10:55 UTC (History)
10 users (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Tar archive with bad filename (40.00 KB, application/x-tar)
2007-02-15 10:05 UTC, jakubsuchy
no flags Details
zip containing an empty document with three different names (15.29 KB, application/x-compressed)
2007-02-15 13:34 UTC, maisonobe
no flags Details
Illustrating screenshot (302.30 KB, image/png)
2007-04-26 20:46 UTC, kpalagin
no flags Details
Screenshot of console and file browser (53.58 KB, image/png)
2007-04-27 13:43 UTC, kpalagin
no flags Details
OO canot open files on the Mac OS X (4.90 MB, application/octet-stream)
2009-07-17 10:22 UTC, yogurtdanone
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description maisonobe 2006-09-29 11:08:42 UTC
when I try to open a file with an accented character in its name coming from
another OS and using a different encoding, I cannot open it in OpenOffice.org.

One way to reproduce this is to create a file with a 'é' character in its name
on a system using latin-1 or latin-9 encoding with Word under Windows for
example, to put it in a zip archive, to transfer the archive on a GNU/Linux
machine running Ubuntu with UTF-8 encoding and extract the file from the archive.

The 'é' character in the filename is encoded in latin-1, which correcpond to the
single byte 0xe9. When OpenOffice.org attempts to open this file, I think it
recognize the letter and before opening the file convert it to the correct
native encoding on the platform, i.e. using the two bytes 0xc3 0xa9 in this
case, which is correct in UTF-8. However, this is NOT the name of the file and
the file cannot be opened. This fails either of I give the name on the command
line or if I try to open the file from the GUI.

This seems somewhat related to issue 25416. I don't know if it is the same bug
reopening or if it is a side effect of the fix.
Comment 1 michael.ruess 2006-09-29 11:28:27 UTC
Framework issue.
Comment 2 Rainer Bielefeld 2007-02-03 08:41:05 UTC
@maisonobe 
Please attach a .zip containing a document with a non working name
Comment 3 jakubsuchy 2007-02-15 10:05:56 UTC
Created attachment 43029 [details]
Tar archive with bad filename
Comment 4 Rainer Bielefeld 2007-02-15 11:14:49 UTC
I checked with "2.2.0  Dev. Snapshot  WIN XP: [680m7(Build9118)]" and I checked
with "2.0.2  German version WIN XP: [680m5(Build9011)]" 

No problem to open the documents in the tar from OOo with OOo or WIN dialogue
and also from WIN explorer. LINUX specific?
Comment 5 maisonobe 2007-02-15 13:32:53 UTC
Rather than Linux or Windows specific, I would say it may be cross-platform
specific ;-) I was not able to open the file in the attachement provided by
jakubsuchy on my UTF-8 Ubuntu box.

The problem is when a file is transfered from one system to another one, or when
the name is changed to a name which is correct in a non default encoding. I will
attach another zip with an empty document file with three different names. The
content of the three files is exactly the same, only the name is changed. One
name is in UTF-8, another one is in Latin1 and the last one is in some MAC
encoding (I think). I can open two of these files here (UTF-8 and MAC) but not
the third one (Latin1). The Latin1 file appears in the files selector and I can
select it, but opening fails with a message stating that the file does not exists.
Maybe some other systems the files that fails or not will differ, I cannot say.
I think the important think is that the Latin1 file (in my case) *is* a
legitimate filename on some platforms, so I think OOo tries to be too smart and
some layer understands the encoding and converts it before opening the file.
Comment 6 maisonobe 2007-02-15 13:34:28 UTC
Created attachment 43035 [details]
zip containing an empty document with three different names
Comment 7 kpalagin 2007-04-26 19:28:00 UTC
Seems to be dup of http://www.openoffice.org/issues/show_bug.cgi?id=59251.
Resolving as such.

*** This issue has been marked as a duplicate of 59251 ***
Comment 8 tml 2007-04-26 19:45:29 UTC
If this happens on Linux, it can't be a duplicate of 59251, as that issue is
about the totally Windows-specific problem. But maybe several unrelated problem
scenarios are mixed up here, and one of them is the 59251 one, i.e. opening
documents by double-clicking in Explorer when the file name contains characters
not in the system codepage.
Comment 9 kpalagin 2007-04-26 20:34:52 UTC
Reopening as per tml.

Comment 10 kpalagin 2007-04-26 20:42:40 UTC
Confirming with 2.2 on Suse 10.2 KDE - one of the files (circled in green on 
attached screenshot) from "three-names.zip" could not be opened from desktop 
by clicking it. File - Open would not even show the file in the list (circled 
in red).
Comment 11 kpalagin 2007-04-26 20:46:00 UTC
Created attachment 44713 [details]
Illustrating screenshot
Comment 12 tml 2007-04-27 13:08:21 UTC
I think the problem here is that OOo doesn't treat file names on Unix as opaque
byte strings, but interprets them according to the codeset/encoding (not really
sure about the correct terminology here) of the user's current locale? This is
problematic as file names on Unix *are* just byte strings.

Any interpretation of file name byte strings as UTF-8 or something else is up to
the user leve software. It's up to the user's carefulness and cluefulness, and
site policies whether the actual file names present on a Unix file system are in
some consistent codeset/encoding or not. I assume it is very common that Western
European Unix installation, for example, have file names both in ISO8859-1 and
UTF-8.

The file names (byte strings) in the three-names.zip file are, with Perl-style
hex escape syntax:

1) accentu\x{c3}\x{a9}.odt
2) accentu\x{e9}.odt
3) accentu\x[e2}\x{88}\x{9a}\x{c2}\x[a9}.odt

The zip format apparently stores file names just as byte streams. According to
http://www.pkware.com/documents/casestudies/APPNOTE.TXT there are mechanisms to
indicate the codeset and encoding of the file names, but I don't know how well
those are implemented and adhered to by the actual zip implementations.

I haven't checked the code, but apparently OOo trusts that the codeset/encoding
the user's locale setting indicates really is enforced, and that all file names
encountered are legal in that codeset/encoding. It probably converts the
filenames from this encoding to its internal UTF-16 string format. File names on
disk that aren't legal in the locale's encoding are just skipped. 

In your SUSE case, apparently the encoding the locale indicates is in use is
UTF-8. Only the file names 1) and 3) above are legal UTF-8 strings. The file
name 2) presumably is in some single-byte codeset like CP1252. It is not legal
UTF-8, so OOo just skips it. How does the directory listing look in a shell
window, or a file browser window?

Fixing this problem might be quite hard. It might also be argued that this is a
case of garbage in, garbage out. If the user can't keep track of using a
consistent encoding for her file names, why should OOo care ;)
Comment 13 kpalagin 2007-04-27 13:43:32 UTC
Created attachment 44734 [details]
Screenshot of console and file browser
Comment 14 kpalagin 2007-04-27 13:52:46 UTC
Screenshot is attached.
I think that trying to enforce some naming rules in addition to those of OS is 
wrong and will negatively affect our image - if OS can handle the file, why 
app should not be able too. Wasn't that you who quoted 
"be liberal in what you accept, strict in what you generate"? :D
Comment 15 maisonobe 2007-04-27 13:57:24 UTC
I agree with the fact the encoding is inconsistent with the locale setting, this
is exactly my point in fact. However, I think it is important to handle this
case transparently. The real life situation in which I encountered the case
(several times) is when I receive files created by co-workers (well, mainly my
boss) on their system in order to provide some content and send it back to them.

The file name may seem strange according to *my* system, but it is neither
forbidden by Unix file naming rules (as long as some special characters are
avoided) nor impossible to handle with other applications (mail, shell commands,
browsers ...).

I said in my first post OOo tries to be too smart here. The file selection
widget succeed in handling the name as it can (sometimes) put it in the
selectable list and it allows the user to click on it. However once this has
been done and once the name is provided to some other part of the code which
will open it, it seems some sort re-encoding or normalization layer is traversed
and a different name is provided to the "open" system call, which fails because
the re-encoded name does not correspond to any existing file.

So I see a file on a widget, click on it, and OOo says "no such file or
directory". This is a real problem, regardless of the fact the name is badly
chosen or inconsistent with local. It is a valid name for the operating system
and it should not be transformed.

I understand such transcoding layers are interesting for display purposes, for
example to show a human readable version of the name in the selection widget, I
don't agree with a transcoding being inserting between the selection and the
opening of the file.
Comment 16 thorsten.martens 2007-07-09 15:15:44 UTC
changed to enhancement and sent to requirements to be discussed how to handle
this problem.
Comment 17 kpalagin 2007-07-09 16:02:19 UTC
Thorsten,
I respectfully disagree with your assesment. This _is_ defect as we can't load 
validly named file - essential functionality is broken (even if it was broken 
since pre-1.0 times).
Comment 18 maisonobe 2007-07-09 16:22:52 UTC
I don't understand why this problem is now considered an enhancement request.
Since the badly-named file does exist on the file system and can be successfully
opened by all applications on my Linux box except OOo, I still consider it is a
problem.
It may end up as an enhancement request for some low level transcoding layer,
but at the application level at which I reported it, it is a defect.

Could you explain me the rationale behind changing the issue type to from defect
to enhancement ?
Comment 19 kpalagin 2008-01-21 20:19:05 UTC
Thorsten,
please reconsider change from defect to enhancement.

Thanks.
Comment 20 Olaf Felka 2009-06-26 14:33:49 UTC
CCing tm and mba for further investigations
Comment 21 Raphael Bircher 2009-07-02 16:27:27 UTC
*** Issue 101322 has been marked as a duplicate of this issue. ***
Comment 22 Mathias_Bauer 2009-07-02 16:38:49 UTC
Sorry, but how on earth can one get the idea that "requirements" could be the
right owner of this issue? This clearly is a technical problem. Thanks to of for
making me aware of that issue.

As for the problem itself, I don't care if it is a defect or an enhancement.
There are good arguments for both. It seems that (according to tml) OOo's
treatment of file names is at fault in general here. So one can say that it
"works as designed" but the design is wrong here. 

First I want to understand the problem a little bit better. I think Mikhail is
the best owner of this issue. 
Comment 23 yogurtdanone 2009-07-17 10:22:11 UTC
Created attachment 63597 [details]
OO canot open files on the Mac OS X
Comment 24 yogurtdanone 2009-07-17 10:25:34 UTC
Guys, I attached video file.
My issue 101322 has been marked as a duplicate of this issue, so I think it can
help you to resolve it.
For me this problem is Mac OS specifically.
Comment 25 sobukus 2009-11-24 10:06:00 UTC
I've just been hit by this with OpenOffice 3.0.0 (still habe to upgrade to a 
more current release, but the state of this issue report leads me to suspect 
that it won't change behaviour on file encodings).

I tried to open a file from the shell, a file I copied from a FAT-formatted 
USB drive, that has a latin1-character in its name. I have UTF-8 locale on 
Unix, the file name looks funny in the shell -- but I can work with the file 
just fine, except with OpenOffice, which lies to me with a straight 
face: "This file does not exist."

People, this is serious failure.  Call it bug or enhancement, if you please, 
but please, please with sugar on top, consider fixing it. This is your program 
simply refusing to do it's basic work on a file that can be accessed by every 
other program on the system.

File names on UNIX just are byte streams. Encoding from the locale is only 
there for pretty-printing of these bytes. If OpenOffice starts interpreting 
and converting the file names according to some encoding rules, it is getting 
this basic fact plainly wrong and is broken by design.
The fix should not be that hard: Just do your encoding magic solely for 
display, but open the underlying file using the raw bytes you get from the 
file system.
Comment 26 mikhail.voytenko 2009-11-24 11:00:57 UTC
Yes of course, the conversion of the pathes is problemmatic here. But the main
strong reason for the conversion seems to be the representation of the system
pathes as "file:" urls, and the way the conversion is done.

The current design is based on the fact that the encoding of the system path is
known, although it is not really the case here. Thus we have no real roundtrip
during the conversion and loose the information.

It is not easy to change it currently, since the office is completely based on
the URLs handling. Workarounding the problem by transporting of the original
system path through the API would do the trick, but it would also a very complex
change.

We could try to change the internal file: URLs concept in the way that it would
allow the round-trip.
Comment 27 Raphael Bircher 2010-02-02 18:46:11 UTC
*** Issue 108790 has been marked as a duplicate of this issue. ***
Comment 28 goc 2010-04-08 17:22:12 UTC
I'm having the same problem with OOo 3.1.1 on Linux/amd64. Someone sent me a
file from his Mac with this filename:

$ ls -l Homepage_Seite_1_�\ Robert\ Gortana\ -\ Fotolia.com.jpg |hexdump -c
0000000   -   r   w   -   -   -   -   -   -   -       1       t   o   n
0000010   i       s   t   a   f   f       1   7   3   7   5   1   5    
0000020   2   0   1   0   -   0   4   -   0   8       1   7   :   3   1
0000030       H   o   m   e   p   a   g   e   _   S   e   i   t   e   _
0000040   1   _ 251       R   o   b   e   r   t       G   o   r   t   a
0000050   n   a       -       F   o   t   o   l   i   a   .   c   o   m
0000060   .   j   p   g  \n                                            
0000065


In the "Open File" dialogue, I can see an icon for "regular file" (as opposed to
"directory"), but no filename. The "Type" column says "File", and the "Size"
column says "0 Bytes", but "ls -l" in an xterm reveals that the file has 1737515
bytes. I can open it with other applications via the xterm, but not with
OpenOffice.org.
Comment 29 Rainer Bielefeld 2010-06-20 17:06:17 UTC
*** Issue 112293 has been marked as a duplicate of this issue. ***
Comment 30 Rainer Bielefeld 2010-07-02 15:39:26 UTC
*** Issue 90262 has been marked as a duplicate of this issue. ***
Comment 31 mind_booster_noori 2010-07-02 16:06:45 UTC
I agree that this is *not* a request for enhancement, but a valid bug report.
You might not consider important or urgent, but how ever minor you consider
this, is is still a defect and not a nice to have.
Comment 32 Maxim 2011-11-03 10:51:55 UTC
Any progress on this one?
It's an old (>5 years! c'mon guys!) and annoying bug in OO for MacOSX. And it's an instant showstopper if you're working with languages other than English.

Filenames are displayed in error messages correctly, so what prevents OO from actually opening the file?

Tested on 3.3.0
OOO33m20 (Build:9567)

Screenshots: 
https://skitch.com/metalim/gfshc/oo-3.3-open
https://skitch.com/metalim/gfs57/openoffice.org-3.3-filename-natinal-chars
Comment 33 Marcus 2017-05-20 10:55:16 UTC
Reset assigne to the default "issues@openoffice.apache.org".