Issue 102920

Summary: i18npool: OUString's are really utf16 strings. Attached .odt continually loops due to mismatch
Product: Internationalization Reporter: caolanm
Component: i18npoolAssignee: stefan.baltzer
Status: CLOSED FIXED QA Contact: issues@l10n <issues>
Severity: Trivial    
Priority: P3 CC: issues, kamataki
Version: OOo 3.1   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: PATCH Latest Confirmation in: ---
Developer Difficulty: ---
Issue Depends on:    
Issue Blocks: 102943    
Attachments:
Description Flags
sample problematic character
none
possible patch, or at least helpfully indicative none

Description caolanm 2009-06-18 17:13:41 UTC
As described in https://bugzilla.redhat.com/show_bug.cgi?id=506545 the more
uncommon chinese characters which are encoded as two 16bit values in our strings
are fairly commonly mistyped which causes loops in writers SwScanner, the root
cause seems to be the xdictionary stuff in i18npool.

Attached is a sample document (you may have to press space or tab at the start
of the line to trigger it). And a sample patch which at least makes the crash go
away, and is probably closer to being correct that the current stuff, but its
hard to tell if more is needed, or if the patch goes the wrong direction. But
there's definitely a problem in there anyway :-)
Comment 1 caolanm 2009-06-18 17:14:17 UTC
Created attachment 63074 [details]
sample problematic character
Comment 2 caolanm 2009-06-18 17:14:54 UTC
Created attachment 63075 [details]
possible patch, or at least helpfully indicative
Comment 3 ooo 2009-06-19 10:05:18 UTC
Ayay.. thanks! I don't see anything wrong with this patch at a first glance.
There just may be more places where iterateCodePoints() should be used.
Comment 4 caolanm 2009-06-19 10:12:48 UTC
Yeah what's changed here is probably good, just not sure if more should be
changed, especially around seeing if or if not such a char should be considered
in the gendict lookup table
Comment 5 ooo 2009-08-10 19:46:32 UTC
Reassigning to spare time account.
Comment 6 erack 2009-08-17 21:04:42 UTC
In cws locales32:

revision 275072
i18npool/inc/xdictionary.hxx
i18npool/source/breakiterator/breakiteratorImpl.cxx
i18npool/source/breakiterator/breakiterator_cjk.cxx
i18npool/source/breakiterator/xdictionary.cxx

Also adapted local iterateCodePoints() in breakiteratorImpl.cxx to cope with
surrogates at text end. Use OUString::iterateCodePoints() in
BreakIterator_CJK::getLineBreak()

I actually have no idea if and how surrogates could be handled with the
gendict dictionary.
Comment 7 ooo 2009-09-04 16:29:46 UTC
Reassigning to QA for verification.
Comment 8 stefan.baltzer 2009-09-11 15:31:08 UTC
Verified in CWS locales32.
Comment 9 caolanm 2009-10-02 10:06:48 UTC
closed, seen m60