Apache OpenOffice (AOO) Bugzilla – Issue 102943
meta-issue for tracking non-baseplane unicode problems
Last modified: 2017-05-20 10:44:19 UTC
Many parts of the OOo codebase have the common problem that only codepoints from the unicode base plane are considered. As a rule of thumb any use of individual sal_Unicodes has these problems, because that type was defined as only 16-bit unsigned integer. Th Better alternatives (like sal_UCS4 or sal_UTF32) as suggested in http://www.openoffice.org/servlets/BrowseList?list=interface-discuss&by=thread&from=1589799 were rejected because other 32bit types could be used. That important semantic context (the value representing a unicode codepoint)t would be lost by this "interface vs. implementation" abuse though. With some automation it should be possible to find code that uses indiviual sal_Unicodes. Looking over every one of them and fixing them (using something that doesn't loose the higher order bits and the unicode semantic) at is a lot of work though. This tracker issue could help to determine the priority of such a high effort task.
added first batch of dependencies
more blocking issues
hi hdu, these kinds of problems don't surprise me at all. and no matter how many of these you solve, it's all too easy to add new ones. have you thought about attacking the root of the problem, and remove ::rtl::OUString::operator sal_Unicode* ? if someone really has a need to access UTF-16 code units as opposed to Unicode characters (say for serialization), make them use a getBuffer method (i think it already exists). ob. quote: "UTF-16 is the devil's work." -- Robert O'Callahan
IMHO an approach based on string iterators would work better as it would isolate implementation details (such as an internal UTF-16 representation) from its use. E.g. for working on unicode codepoints one would get an UTF-32 iterator, for encoding conversions (e.g. to big5) one would get other suitable iterators. By splitting unicode string's implementation details (UTF-16) from its interface (specialized string iterators) this could also speed up such performance critical tasks as XML parsing. XML text is usually encoded as UTF-8 and AFAIK it currently has to be converted to UTF-16 for further processing. By keeping the inputs native encoding as an implementation detail the conversion step which is costly (from a processing, from a memory and from a spinlock contention perspective) could be avoided altogether.
@ hdu: hmm, your iterator suggestion sounds pretty ideal; but can you maintain binary compatibility with the existing ::rtl::OUString if you change its representation? i haven't dared think about this :) or do you suggest a new string class, with (mostly) the same interface, but without the problematic methods, and with some kind of efficient conversion to/from OUString? [sorry for the double posting, seems i accidentally clicked in the wrong place]
Re "With some automation it should be possible to find code that uses indiviual sal_Unicodes" see <http://www.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=18462> (from 2006). Re "your iterator suggestion sounds pretty ideal" see the existing rtl::OUString::iterateCodePoints.
@ sb: yes, there is rtl::OUString::iterateCodePoints. but there is also rtl::OUString::operator sal_Unicode*, and that is unfortunately a _lot_ more popular with users of OUString. and hdu actually suggested to have a Unicode string that can internally store any encoding, but externally present only iterator-based interfaces for various encodings; hence my concern about the binary compatibility of such a contraption.
Yes, msgNo=18462 is a good start as it identified problems in the UNO API. Finding the remaining problems (individual sal_Unicodes) is the other important task. OUString::iterateCodePoints() was a good start too, as it was the first method in the string area which didn't require its users to handle surrogate pairs themselves. The iterator approach I outlined above is IMHO better though because it could allow zero-conversion and zero-copy access to raw data, such as the performance critical XML files. The current approach to convert them first to UTF-16 and then use iterateCodePoints() to convert them to UTF-32 does not have that benefit.
I forgot to mention the benefit that specialized iterators could also do such nice things as providing transliteration, unicode decomposition, pre-composition, digit-conversion, etc. in an orthogonal way.
CC myself
Added CC myself
Reset the assignee to the default "issues@openoffice.apache.org".