102943 – meta-issue for tracking non-baseplane unicode problems

Issue 102943 - meta-issue for tracking non-baseplane unicode problems

Summary: meta-issue for tracking non-baseplane unicode problems

Status:	ACCEPTED

Alias:	None

Product:	Internationalization
Classification:	Code
Component:	code (show other issues)
Version:	OOo 3.0
Hardware:	All All

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:	CJK, performance

Depends on:	41792 49432 78162 102200 103123 103308 124312 125232 125257 74049 75412 102920 105571 105901 107468 113757 120442
Blocks:
	Show dependency tree

Reported:	2009-06-19 11:47 UTC by hdu@apache.org
Modified:	2017-05-20 10:44 UTC (History)
CC List:	10 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description hdu@apache.org 2009-06-19 11:47:18 UTC

Many parts of the OOo codebase have the common problem that only codepoints from the unicode 
base plane are considered. As a rule of thumb any use of individual sal_Unicodes has these problems, 
because that type was defined as only 16-bit unsigned integer. Th

Better alternatives (like sal_UCS4 or sal_UTF32) as suggested in http://www.openoffice.org/servlets/BrowseList?list=interface-discuss&by=thread&from=1589799 
were rejected because other 32bit types could be used. That important semantic context (the value 
representing a unicode codepoint)t would be lost by this "interface vs. implementation" abuse though.

With some automation it should be possible to find code that uses indiviual sal_Unicodes. Looking over 
every one of them and fixing them (using something that doesn't loose the higher order bits and the 
unicode semantic) at is a lot of work though. This tracker issue could help to determine the priority of 
such a high effort task.

Comment 1 hdu@apache.org 2009-06-19 11:52:44 UTC

added first batch of dependencies

Comment 2 hdu@apache.org 2009-06-19 11:58:57 UTC

more blocking issues

Comment 3 mst.ooo 2009-06-22 10:35:25 UTC

hi hdu,

these kinds of problems don't surprise me at all. and no matter how many of
these you solve, it's all too easy to add new ones. have you thought about
attacking the root of the problem, and remove ::rtl::OUString::operator
sal_Unicode* ?
if someone really has a need to access UTF-16 code units as opposed to Unicode
characters (say for serialization), make them use a getBuffer method (i think it
already exists).

ob. quote:
"UTF-16 is the devil's work." -- Robert O'Callahan

Comment 4 mst.ooo 2009-06-22 10:36:04 UTC

hi hdu,

these kinds of problems don't surprise me at all. and no matter how many of
these you solve, it's all too easy to add new ones. have you thought about
attacking the root of the problem, and remove ::rtl::OUString::operator
sal_Unicode* ?
if someone really has a need to access UTF-16 code units as opposed to Unicode
characters (say for serialization), make them use a getBuffer method (i think it
already exists).

ob. quote:
"UTF-16 is the devil's work." -- Robert O'Callahan

Comment 5 hdu@apache.org 2009-06-22 11:00:08 UTC

IMHO an approach based on string iterators would work better as it would isolate implementation 
details (such as an internal UTF-16 representation) from its use. E.g. for working on unicode 
codepoints one would get an UTF-32 iterator, for encoding conversions (e.g. to big5) one would get 
other suitable iterators.

By splitting unicode string's implementation details (UTF-16) from its interface (specialized string 
iterators) this could also speed up such performance critical tasks as XML parsing. XML text is usually 
encoded as UTF-8 and AFAIK it currently has to be converted to UTF-16 for further processing. By 
keeping the inputs native encoding as an implementation detail the conversion step which is costly 
(from a processing, from a memory and from a spinlock contention perspective) could be avoided 
altogether.

Comment 6 mst.ooo 2009-06-22 11:18:14 UTC

@ hdu:

hmm, your iterator suggestion sounds pretty ideal; but can you maintain binary
compatibility with the existing ::rtl::OUString if you change its representation?
i haven't dared think about this :)

or do you suggest a new string class, with (mostly) the same interface, but
without the problematic methods, and with some kind of efficient conversion
to/from OUString?

[sorry for the double posting, seems i accidentally clicked in the wrong place]

Comment 7 Stephan Bergmann 2009-06-22 11:34:10 UTC

Re "With some automation it should be possible to find code that uses indiviual
sal_Unicodes" see
<http://www.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=18462> (from 2006).

Re "your iterator suggestion sounds pretty ideal" see the existing
rtl::OUString::iterateCodePoints.

Comment 8 mst.ooo 2009-06-22 11:48:11 UTC

@ sb:

yes, there is rtl::OUString::iterateCodePoints.
but there is also rtl::OUString::operator sal_Unicode*, and that is
unfortunately a _lot_ more popular with users of OUString.

and hdu actually suggested to have a Unicode string that can internally store
any encoding, but externally present only iterator-based interfaces for various
encodings; hence my concern about the binary compatibility of such a contraption.

Comment 9 hdu@apache.org 2009-06-22 12:03:54 UTC

Yes, msgNo=18462 is a good start as it identified problems in the UNO API. Finding the remaining 
problems (individual sal_Unicodes) is the other important task.

OUString::iterateCodePoints() was a good start too, as it was the first method in the string area which 
didn't require its users to handle surrogate pairs themselves. The iterator approach I outlined above is 
IMHO better though because it could allow zero-conversion and zero-copy access to raw data, such as 
the performance critical XML files. The current approach to convert them first to UTF-16 and then use 
iterateCodePoints() to convert them to UTF-32 does not have that benefit.

Comment 10 hdu@apache.org 2009-06-22 12:14:11 UTC

I forgot to mention the benefit that specialized iterators could also do such nice things as providing 
transliteration, unicode decomposition, pre-composition, digit-conversion, etc. in an orthogonal way.

Comment 11 Oliver-Rainer Wittmann 2009-10-14 12:29:35 UTC

CC myself

Comment 12 hanya 2012-07-16 05:42:05 UTC

Added CC myself

Comment 13 Marcus 2017-05-20 10:44:19 UTC

Reset the assignee to the default "issues@openoffice.apache.org".