Issue 102920 - i18npool: OUString's are really utf16 strings. Attached .odt continually loops due to mismatch
Summary: i18npool: OUString's are really utf16 strings. Attached .odt continually loop...
Status: CLOSED FIXED
Alias: None
Product: Internationalization
Classification: Code
Component: i18npool (show other issues)
Version: OOo 3.1
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@l10n
URL:
Keywords:
Depends on:
Blocks: 102943
  Show dependency tree
 
Reported: 2009-06-18 17:13 UTC by caolanm
Modified: 2013-08-07 15:02 UTC (History)
2 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
sample problematic character (7.12 KB, application/vnd.oasis.opendocument.text)
2009-06-18 17:14 UTC, caolanm
no flags Details
possible patch, or at least helpfully indicative (5.48 KB, patch)
2009-06-18 17:14 UTC, caolanm
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description caolanm 2009-06-18 17:13:41 UTC
As described in https://bugzilla.redhat.com/show_bug.cgi?id=506545 the more
uncommon chinese characters which are encoded as two 16bit values in our strings
are fairly commonly mistyped which causes loops in writers SwScanner, the root
cause seems to be the xdictionary stuff in i18npool.

Attached is a sample document (you may have to press space or tab at the start
of the line to trigger it). And a sample patch which at least makes the crash go
away, and is probably closer to being correct that the current stuff, but its
hard to tell if more is needed, or if the patch goes the wrong direction. But
there's definitely a problem in there anyway :-)
Comment 1 caolanm 2009-06-18 17:14:17 UTC
Created attachment 63074 [details]
sample problematic character
Comment 2 caolanm 2009-06-18 17:14:54 UTC
Created attachment 63075 [details]
possible patch, or at least helpfully indicative
Comment 3 ooo 2009-06-19 10:05:18 UTC
Ayay.. thanks! I don't see anything wrong with this patch at a first glance.
There just may be more places where iterateCodePoints() should be used.
Comment 4 caolanm 2009-06-19 10:12:48 UTC
Yeah what's changed here is probably good, just not sure if more should be
changed, especially around seeing if or if not such a char should be considered
in the gendict lookup table
Comment 5 ooo 2009-08-10 19:46:32 UTC
Reassigning to spare time account.
Comment 6 erack 2009-08-17 21:04:42 UTC
In cws locales32:

revision 275072
i18npool/inc/xdictionary.hxx
i18npool/source/breakiterator/breakiteratorImpl.cxx
i18npool/source/breakiterator/breakiterator_cjk.cxx
i18npool/source/breakiterator/xdictionary.cxx

Also adapted local iterateCodePoints() in breakiteratorImpl.cxx to cope with
surrogates at text end. Use OUString::iterateCodePoints() in
BreakIterator_CJK::getLineBreak()

I actually have no idea if and how surrogates could be handled with the
gendict dictionary.
Comment 7 ooo 2009-09-04 16:29:46 UTC
Reassigning to QA for verification.
Comment 8 stefan.baltzer 2009-09-11 15:31:08 UTC
Verified in CWS locales32.
Comment 9 caolanm 2009-10-02 10:06:48 UTC
closed, seen m60