Issue 41792 - Incorrect Handling of Surrogate Pair
Summary: Incorrect Handling of Surrogate Pair
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOo 1.1.2
Hardware: PC Linux, all
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: oooqa
Depends on: 40391 45983 105571 105901
Blocks: 102943
  Show dependency tree
 
Reported: 2005-02-01 11:46 UTC by xieqian
Modified: 2013-08-07 14:38 UTC (History)
3 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
sample document using the non-BMP codepoints for the Phaistos disc (96.82 KB, application/vnd.oasis.opendocument.text)
2010-08-10 14:48 UTC, hdu@apache.org
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description xieqian 2005-02-01 11:46:00 UTC
This bug is related to issue 40391. When testing non-BMP character support in 
OpenOffice.org 1.1.2 for Linux coming with Fedora Core 3, I found problems 
involving surrogate pair handling. Besides the display issue that mentioned in 
issue 40391, the internal processing is also problematic. The following is the 
description.

1.  Test method: Text used to test is <U00004E86> <U0002010F> <U00004E8C> 
<U0002011F>, all are Chinese characters, including two SIP characters. The 
sample text was hardcoded into a plain text file. The file is opened with 
oowriter, and normal editing operations, such as selecting, deleting, copy-
pasting, are performed.

2. Phenomena: The SIP characters cannot be displayed, whilst blank space is 
kept for each SIP character which occupying width of two characters, and 
actually can be operated as two characters. Although one cannot move caret into 
a SIP character using arrow keys, one can select parts of the character by 
means of mouse operation. Consequently, surrogate pair may lose integrity in 
deleting or copy-pasting operations. This can be observed by monitoring the 
internal form during communication. The target type is UTF8_STRING.
  When selecting all, got E4 BA 86 F0 A0 84 8F E4 BA 8C F0 A0 84 9F.
  When selecting first half of <U0002010F>, got noting.
  When selecting second half of <U0002010F>, got 3F, code value of '?'.
  When deleting first half of <U0002010F>, then selecting all, got E4 BA 86 3F 
E4 BA 8C F0 A0 84 9F.
  When deleting second half of <U0002010F>, then selecting all, got E4 BA 86 3F 
F0 A0 84 9F, indicating the partial surrogate is combining with a normal 
character.

3. Conclusion: There are some protections of surrogate pair to preserve the 
integrity during editing operations, but far from enough. Recognition of 
invalid surrogate character should be enhanced. More importantly, operating 
approach that may damage integrity of surrogate pair must be totally 
eliminated, which seems to require a profound evolvement of some OOo's 
fundamental facilities.

Although only one particular version of OOo is tested, I believe the problem 
exist in all versions. I am not a OOo developer, and not sure to whom should 
this issue assigned to. But I think filing such a bug report maybe do some help.
Comment 1 christof.pintaske 2005-02-01 12:43:26 UTC
looks like a writer issue (the issue with displaying the characters is handled
in i40391)
Comment 2 ulf.stroehler 2005-02-02 14:53:25 UTC
us->dvo: not sure whether this issue is already covered by issue 40391. Also I
don't know if it's gsl or Writer. Could you pls. dispatch if not you and set an
appropriate target. Thanks, Ulf.
Comment 3 thackert 2005-06-12 17:04:40 UTC
I have seen that this issue is OOo 1.1.2 related (which is rather old) and that the last entry was in 
February ... :(
Have you tried this with OOo 1.1.3, 1.1.4 or some 1.9.xxx builds? Does your problem occur there, 
too?
Comment 4 xieqian 2005-06-13 04:33:53 UTC
No, I didn't try these newer versions, because they are not handy. The reason 
cause the bugs is quite profound, I believe the phenomena will be there as 
long as OOo uses UTF-16 as internal coding format while some upper layer 
components calculate string length by counting 16-bit units rather than using 
some unified APIs. 
Comment 5 andreas.martens 2005-06-28 16:26:38 UTC
Redistributing dvo's issues.
Comment 6 lohmaier 2005-08-23 23:22:04 UTC
For me, It is covered with issue 45983 and issue 40391 (feel free to mark it
duplicate)

But anyway: This issue is already confirmed by different people but still
unconfirmed → confirming.
To get the relation to the other two surrogate issues (win & linux) I set those
in the depends-on field.
Comment 7 tino.rachui 2005-10-10 09:37:08 UTC
Retargeted to OOo later
Comment 8 tino.rachui 2006-11-17 08:06:10 UTC
tra->mba: You might want to distribute this one insight your team
Comment 9 Mathias_Bauer 2006-12-08 11:36:12 UTC
Stephan can you help me distributing this issue to the right developer?
Comment 10 Stephan Bergmann 2006-12-11 13:23:30 UTC
@ama:  As discussed with mba:  The code in Writer that determines what has been
selected appears to not take care of surrogate pairs.
Comment 11 hdu@apache.org 2010-08-10 11:51:22 UTC
This issue has probably fixed indirectly via issues 105901 and issue 105571.
If testing confirms that all is fine now I'd close it as a duplicate to 105901.
Comment 12 hdu@apache.org 2010-08-10 12:09:11 UTC
Looking at issue 105571's desc15 the problem here can still occur because the fix I suggested in desc14 
was not fully implemented there but only for scripts classified as CTL. Examples to reproduce it should use 
e.g. codepoints from the math symbols in U+1D400..U+1D7FF.
Comment 13 hdu@apache.org 2010-08-10 14:48:21 UTC
Created attachment 71027 [details]
sample document using the non-BMP codepoints for the Phaistos disc