Apache OpenOffice (AOO) Bugzilla – Issue 69973
filenames with non native encoding cannot be loaded
Last modified: 2017-05-20 10:55:16 UTC
when I try to open a file with an accented character in its name coming from another OS and using a different encoding, I cannot open it in OpenOffice.org. One way to reproduce this is to create a file with a 'é' character in its name on a system using latin-1 or latin-9 encoding with Word under Windows for example, to put it in a zip archive, to transfer the archive on a GNU/Linux machine running Ubuntu with UTF-8 encoding and extract the file from the archive. The 'é' character in the filename is encoded in latin-1, which correcpond to the single byte 0xe9. When OpenOffice.org attempts to open this file, I think it recognize the letter and before opening the file convert it to the correct native encoding on the platform, i.e. using the two bytes 0xc3 0xa9 in this case, which is correct in UTF-8. However, this is NOT the name of the file and the file cannot be opened. This fails either of I give the name on the command line or if I try to open the file from the GUI. This seems somewhat related to issue 25416. I don't know if it is the same bug reopening or if it is a side effect of the fix.
Framework issue.
@maisonobe Please attach a .zip containing a document with a non working name
Created attachment 43029 [details] Tar archive with bad filename
I checked with "2.2.0 Dev. Snapshot WIN XP: [680m7(Build9118)]" and I checked with "2.0.2 German version WIN XP: [680m5(Build9011)]" No problem to open the documents in the tar from OOo with OOo or WIN dialogue and also from WIN explorer. LINUX specific?
Rather than Linux or Windows specific, I would say it may be cross-platform specific ;-) I was not able to open the file in the attachement provided by jakubsuchy on my UTF-8 Ubuntu box. The problem is when a file is transfered from one system to another one, or when the name is changed to a name which is correct in a non default encoding. I will attach another zip with an empty document file with three different names. The content of the three files is exactly the same, only the name is changed. One name is in UTF-8, another one is in Latin1 and the last one is in some MAC encoding (I think). I can open two of these files here (UTF-8 and MAC) but not the third one (Latin1). The Latin1 file appears in the files selector and I can select it, but opening fails with a message stating that the file does not exists. Maybe some other systems the files that fails or not will differ, I cannot say. I think the important think is that the Latin1 file (in my case) *is* a legitimate filename on some platforms, so I think OOo tries to be too smart and some layer understands the encoding and converts it before opening the file.
Created attachment 43035 [details] zip containing an empty document with three different names
Seems to be dup of http://www.openoffice.org/issues/show_bug.cgi?id=59251. Resolving as such. *** This issue has been marked as a duplicate of 59251 ***
If this happens on Linux, it can't be a duplicate of 59251, as that issue is about the totally Windows-specific problem. But maybe several unrelated problem scenarios are mixed up here, and one of them is the 59251 one, i.e. opening documents by double-clicking in Explorer when the file name contains characters not in the system codepage.
Reopening as per tml.
Confirming with 2.2 on Suse 10.2 KDE - one of the files (circled in green on attached screenshot) from "three-names.zip" could not be opened from desktop by clicking it. File - Open would not even show the file in the list (circled in red).
Created attachment 44713 [details] Illustrating screenshot
I think the problem here is that OOo doesn't treat file names on Unix as opaque byte strings, but interprets them according to the codeset/encoding (not really sure about the correct terminology here) of the user's current locale? This is problematic as file names on Unix *are* just byte strings. Any interpretation of file name byte strings as UTF-8 or something else is up to the user leve software. It's up to the user's carefulness and cluefulness, and site policies whether the actual file names present on a Unix file system are in some consistent codeset/encoding or not. I assume it is very common that Western European Unix installation, for example, have file names both in ISO8859-1 and UTF-8. The file names (byte strings) in the three-names.zip file are, with Perl-style hex escape syntax: 1) accentu\x{c3}\x{a9}.odt 2) accentu\x{e9}.odt 3) accentu\x[e2}\x{88}\x{9a}\x{c2}\x[a9}.odt The zip format apparently stores file names just as byte streams. According to http://www.pkware.com/documents/casestudies/APPNOTE.TXT there are mechanisms to indicate the codeset and encoding of the file names, but I don't know how well those are implemented and adhered to by the actual zip implementations. I haven't checked the code, but apparently OOo trusts that the codeset/encoding the user's locale setting indicates really is enforced, and that all file names encountered are legal in that codeset/encoding. It probably converts the filenames from this encoding to its internal UTF-16 string format. File names on disk that aren't legal in the locale's encoding are just skipped. In your SUSE case, apparently the encoding the locale indicates is in use is UTF-8. Only the file names 1) and 3) above are legal UTF-8 strings. The file name 2) presumably is in some single-byte codeset like CP1252. It is not legal UTF-8, so OOo just skips it. How does the directory listing look in a shell window, or a file browser window? Fixing this problem might be quite hard. It might also be argued that this is a case of garbage in, garbage out. If the user can't keep track of using a consistent encoding for her file names, why should OOo care ;)
Created attachment 44734 [details] Screenshot of console and file browser
Screenshot is attached. I think that trying to enforce some naming rules in addition to those of OS is wrong and will negatively affect our image - if OS can handle the file, why app should not be able too. Wasn't that you who quoted "be liberal in what you accept, strict in what you generate"? :D
I agree with the fact the encoding is inconsistent with the locale setting, this is exactly my point in fact. However, I think it is important to handle this case transparently. The real life situation in which I encountered the case (several times) is when I receive files created by co-workers (well, mainly my boss) on their system in order to provide some content and send it back to them. The file name may seem strange according to *my* system, but it is neither forbidden by Unix file naming rules (as long as some special characters are avoided) nor impossible to handle with other applications (mail, shell commands, browsers ...). I said in my first post OOo tries to be too smart here. The file selection widget succeed in handling the name as it can (sometimes) put it in the selectable list and it allows the user to click on it. However once this has been done and once the name is provided to some other part of the code which will open it, it seems some sort re-encoding or normalization layer is traversed and a different name is provided to the "open" system call, which fails because the re-encoded name does not correspond to any existing file. So I see a file on a widget, click on it, and OOo says "no such file or directory". This is a real problem, regardless of the fact the name is badly chosen or inconsistent with local. It is a valid name for the operating system and it should not be transformed. I understand such transcoding layers are interesting for display purposes, for example to show a human readable version of the name in the selection widget, I don't agree with a transcoding being inserting between the selection and the opening of the file.
changed to enhancement and sent to requirements to be discussed how to handle this problem.
Thorsten, I respectfully disagree with your assesment. This _is_ defect as we can't load validly named file - essential functionality is broken (even if it was broken since pre-1.0 times).
I don't understand why this problem is now considered an enhancement request. Since the badly-named file does exist on the file system and can be successfully opened by all applications on my Linux box except OOo, I still consider it is a problem. It may end up as an enhancement request for some low level transcoding layer, but at the application level at which I reported it, it is a defect. Could you explain me the rationale behind changing the issue type to from defect to enhancement ?
Thorsten, please reconsider change from defect to enhancement. Thanks.
CCing tm and mba for further investigations
*** Issue 101322 has been marked as a duplicate of this issue. ***
Sorry, but how on earth can one get the idea that "requirements" could be the right owner of this issue? This clearly is a technical problem. Thanks to of for making me aware of that issue. As for the problem itself, I don't care if it is a defect or an enhancement. There are good arguments for both. It seems that (according to tml) OOo's treatment of file names is at fault in general here. So one can say that it "works as designed" but the design is wrong here. First I want to understand the problem a little bit better. I think Mikhail is the best owner of this issue.
Created attachment 63597 [details] OO canot open files on the Mac OS X
Guys, I attached video file. My issue 101322 has been marked as a duplicate of this issue, so I think it can help you to resolve it. For me this problem is Mac OS specifically.
I've just been hit by this with OpenOffice 3.0.0 (still habe to upgrade to a more current release, but the state of this issue report leads me to suspect that it won't change behaviour on file encodings). I tried to open a file from the shell, a file I copied from a FAT-formatted USB drive, that has a latin1-character in its name. I have UTF-8 locale on Unix, the file name looks funny in the shell -- but I can work with the file just fine, except with OpenOffice, which lies to me with a straight face: "This file does not exist." People, this is serious failure. Call it bug or enhancement, if you please, but please, please with sugar on top, consider fixing it. This is your program simply refusing to do it's basic work on a file that can be accessed by every other program on the system. File names on UNIX just are byte streams. Encoding from the locale is only there for pretty-printing of these bytes. If OpenOffice starts interpreting and converting the file names according to some encoding rules, it is getting this basic fact plainly wrong and is broken by design. The fix should not be that hard: Just do your encoding magic solely for display, but open the underlying file using the raw bytes you get from the file system.
Yes of course, the conversion of the pathes is problemmatic here. But the main strong reason for the conversion seems to be the representation of the system pathes as "file:" urls, and the way the conversion is done. The current design is based on the fact that the encoding of the system path is known, although it is not really the case here. Thus we have no real roundtrip during the conversion and loose the information. It is not easy to change it currently, since the office is completely based on the URLs handling. Workarounding the problem by transporting of the original system path through the API would do the trick, but it would also a very complex change. We could try to change the internal file: URLs concept in the way that it would allow the round-trip.
*** Issue 108790 has been marked as a duplicate of this issue. ***
I'm having the same problem with OOo 3.1.1 on Linux/amd64. Someone sent me a file from his Mac with this filename: $ ls -l Homepage_Seite_1_�\ Robert\ Gortana\ -\ Fotolia.com.jpg |hexdump -c 0000000 - r w - - - - - - - 1 t o n 0000010 i s t a f f 1 7 3 7 5 1 5 0000020 2 0 1 0 - 0 4 - 0 8 1 7 : 3 1 0000030 H o m e p a g e _ S e i t e _ 0000040 1 _ 251 R o b e r t G o r t a 0000050 n a - F o t o l i a . c o m 0000060 . j p g \n 0000065 In the "Open File" dialogue, I can see an icon for "regular file" (as opposed to "directory"), but no filename. The "Type" column says "File", and the "Size" column says "0 Bytes", but "ls -l" in an xterm reveals that the file has 1737515 bytes. I can open it with other applications via the xterm, but not with OpenOffice.org.
*** Issue 112293 has been marked as a duplicate of this issue. ***
*** Issue 90262 has been marked as a duplicate of this issue. ***
I agree that this is *not* a request for enhancement, but a valid bug report. You might not consider important or urgent, but how ever minor you consider this, is is still a defect and not a nice to have.
Any progress on this one? It's an old (>5 years! c'mon guys!) and annoying bug in OO for MacOSX. And it's an instant showstopper if you're working with languages other than English. Filenames are displayed in error messages correctly, so what prevents OO from actually opening the file? Tested on 3.3.0 OOO33m20 (Build:9567) Screenshots: https://skitch.com/metalim/gfshc/oo-3.3-open https://skitch.com/metalim/gfs57/openoffice.org-3.3-filename-natinal-chars
Reset assigne to the default "issues@openoffice.apache.org".