Apache OpenOffice (AOO) Bugzilla – Issue 41792
Incorrect Handling of Surrogate Pair
Last modified: 2013-08-07 14:38:26 UTC
This bug is related to issue 40391. When testing non-BMP character support in OpenOffice.org 1.1.2 for Linux coming with Fedora Core 3, I found problems involving surrogate pair handling. Besides the display issue that mentioned in issue 40391, the internal processing is also problematic. The following is the description. 1. Test method: Text used to test is <U00004E86> <U0002010F> <U00004E8C> <U0002011F>, all are Chinese characters, including two SIP characters. The sample text was hardcoded into a plain text file. The file is opened with oowriter, and normal editing operations, such as selecting, deleting, copy- pasting, are performed. 2. Phenomena: The SIP characters cannot be displayed, whilst blank space is kept for each SIP character which occupying width of two characters, and actually can be operated as two characters. Although one cannot move caret into a SIP character using arrow keys, one can select parts of the character by means of mouse operation. Consequently, surrogate pair may lose integrity in deleting or copy-pasting operations. This can be observed by monitoring the internal form during communication. The target type is UTF8_STRING. When selecting all, got E4 BA 86 F0 A0 84 8F E4 BA 8C F0 A0 84 9F. When selecting first half of <U0002010F>, got noting. When selecting second half of <U0002010F>, got 3F, code value of '?'. When deleting first half of <U0002010F>, then selecting all, got E4 BA 86 3F E4 BA 8C F0 A0 84 9F. When deleting second half of <U0002010F>, then selecting all, got E4 BA 86 3F F0 A0 84 9F, indicating the partial surrogate is combining with a normal character. 3. Conclusion: There are some protections of surrogate pair to preserve the integrity during editing operations, but far from enough. Recognition of invalid surrogate character should be enhanced. More importantly, operating approach that may damage integrity of surrogate pair must be totally eliminated, which seems to require a profound evolvement of some OOo's fundamental facilities. Although only one particular version of OOo is tested, I believe the problem exist in all versions. I am not a OOo developer, and not sure to whom should this issue assigned to. But I think filing such a bug report maybe do some help.
looks like a writer issue (the issue with displaying the characters is handled in i40391)
us->dvo: not sure whether this issue is already covered by issue 40391. Also I don't know if it's gsl or Writer. Could you pls. dispatch if not you and set an appropriate target. Thanks, Ulf.
I have seen that this issue is OOo 1.1.2 related (which is rather old) and that the last entry was in February ... :( Have you tried this with OOo 1.1.3, 1.1.4 or some 1.9.xxx builds? Does your problem occur there, too?
No, I didn't try these newer versions, because they are not handy. The reason cause the bugs is quite profound, I believe the phenomena will be there as long as OOo uses UTF-16 as internal coding format while some upper layer components calculate string length by counting 16-bit units rather than using some unified APIs.
Redistributing dvo's issues.
For me, It is covered with issue 45983 and issue 40391 (feel free to mark it duplicate) But anyway: This issue is already confirmed by different people but still unconfirmed → confirming. To get the relation to the other two surrogate issues (win & linux) I set those in the depends-on field.
Retargeted to OOo later
tra->mba: You might want to distribute this one insight your team
Stephan can you help me distributing this issue to the right developer?
@ama: As discussed with mba: The code in Writer that determines what has been selected appears to not take care of surrogate pairs.
This issue has probably fixed indirectly via issues 105901 and issue 105571. If testing confirms that all is fine now I'd close it as a duplicate to 105901.
Looking at issue 105571's desc15 the problem here can still occur because the fix I suggested in desc14 was not fully implemented there but only for scripts classified as CTL. Examples to reproduce it should use e.g. codepoints from the math symbols in U+1D400..U+1D7FF.
Created attachment 71027 [details] sample document using the non-BMP codepoints for the Phaistos disc