Apache OpenOffice (AOO) Bugzilla – Issue 31168
Data Loss when saving Unicode File to OO File Format
Last modified: 2013-08-07 14:43:31 UTC
Summary: Unicode files with code points U+1XXXX or greater can not be saved in Open Office format. The characters are missing when the *.sxw file is reopened. Setup: OS: Win2000 SP4 w/ registry modifications to support Unicode code points U+1XXXX and greater (see http://www.i18nguy.com/surrogates.html). Also installed GB18030 support package ( http://www.microsoft.com/globaldev/DrIntl/columns/015/default.mspx#Q6) and the free CODE2001 font (http://home.att.net/~jameskass/code2001.htm). Test Case: Step 1: Copied 4 characters U+1D790 U+1D791 U+1D792 U+1D793 into OpenOffice using BabelMap application and doing a copy and paste (OO doesn't have ability to enter code points >= U+FFFF). (Characters are invisible at this point since my default font does not have the glyphs for these characters). Step 2: Select "Edit:Select All" Step 3: Set font to Code2001 (characters are now visible) Step 4: Save as "Text Encoded" UTF8 (This creates a file with a BOM and the 4 characters. Step 5: Close file and reopen. Edit:Select and change font to Code2001 (OO locks up if Code2001 is used as the default font). Characters are visible, so this proves OO can successfully save as a plain UTF8 file. Also verified file correctness with hex editor. Step 6: Now save as "OpenOffice.org 1.0 Text Document (.sxw)" (Again OO locks up for me, might be a Code2001 font problem) Step 7: Reopen the new sxw file. The characters are missing!! Note: The lockup problem does not seem to occur when I use the commercial SimSun (Founder Extended) font, but the missing character problem still occurs. These fonts both have glyphs at code points greater than U+FFFF. See my messsage in the users mail group "Unicode Plane 2 Questions"
Created attachment 16293 [details] UTF8 Encoded File with BOM and 4 characters
Reassigned to ES.
It appears that the four characters >= U+10000 are completely missing from the context.xml stream of the .sxw document generated in step 6.
Update: I tried opening the test case UTF8 file on on Debian Linux kernel 2.4.22 with KDE desktop. With the Linux version of OpenOffice 1.1.2 I only see square symbols. The proper glyphs are not being drawn.
Fixed saving of Unicode >= 0x010000 in XML. (Loading already worked.) Fix is in CWS swqcore02; should make it into milestone SRC680 m65 or so. dvo->reinerg61: Please test once this is available in a public build. Your comment on display should go into a different issue (if it persists), because load/save and the display code are rather different things.
reopen for QA
dvo->es: Please test.
Fixed
reopen
fixed but failed. Now the text does not even display boxes, the document shows as empty. No text in there
Strange. Works for me in swqcore02 build.
dvo->es: Visibility of the symbols may be a font problem. The bugs is hopefully fixed anyway. :-)
set fixed
Verified in cws_swqcore02
ES->DVO: as seen and discussed, the fix is not complete in the master. The characters are there but invisible (not painted). Maybe any conflict with a recent VCL CWS?
.
I repeated the test on build 680_m69 on a Window XP machine. The characters no longer display. But, the characters will display on this machine using version 1.1.4. I tried Uniscribe (usp10.dll) version 1.420.2600.2180 and 1.471.4030.0. Uniscribe (usp10.dll) is used to render glyphs on a Windows machine.
dvo->reinerg61: Thanks for the report. I am seing the same issue here. dvo: I'm in a bit of a fix here, because: 1) The load/save part now works, and 2) Unicode surrogate support seems broken elsewhere, particularly the display, and 3) surrogate support isn't even officially in the product. What I want to do is this: I will consider this issue to be for load/save in XML only. As such, it's a developer bug, fixed, and can now be closed. Dealing with the surrogate display issues is another issue since it touches completely different code. (And doesn't match this issue's description either.) dvo: I pronounce this issue fixed. 'Fixed', in that any surrogate characters should be loaded and saved correctly from/into the Writer XML formats (*.sxw, *.odt). The fix has been integrated into milestone m70. dvo->hdu: Issue #i40391# is a follow-on to this issue deals with the display of surrogate characters. I assign it to you. Examples from this issue can be used to reproduce the problem. dvo->reinerg61: Load/save should work in m70 (or later). The display problem will be tracked using issue 40391.
going to close ancient issues
closing ancient issues