Issue 78127

Summary: Inaccuracy values in "Word Counts" when documents includes surrogates' characters
Product: Writer Reporter: kangjch <kangjingchuan>
Component: codeAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: cfm.huber, chengxiuzhi, issues, jian.li, karl.hong, liujiaxiang, lohmaier, ooo, peter.junge, stephan.bergmann.secondary
Version: OOo 2.2.1 RC2Keywords: oooqa
Target Milestone: ---   
Hardware: PC   
OS: All   
Issue Type: DEFECT Latest Confirmation in: ---
Developer Difficulty: ---

Description kangjch 2007-06-06 03:32:26 UTC
Steps for reproducing:
   1. Create a new document by writer.
   2. Following Insert-> Special Characters, In Special Character dialog,
      select Font: "宋体-方正超大字符集",
      Mouse moving to select Characters: U+20000(code point in unicode)
      then press the "Ok" Button.
   3. Open Tools -> Word Count, in Word Count dialog, 

        Whole documents ----------------
            Words:        0
            Characters:   2 

Desired Results: 

        Whole documents ----------------
            Words:        1
            Characters:   1
Comment 1 michael.ruess 2007-06-06 09:21:41 UTC
Reassigned to SBA.
Comment 2 Stephan Bergmann 2007-06-21 09:35:49 UTC
@er, khong:  kangjch would like to investigate into this issue, but is unsure
where to look.  Does anybody know whether this word counting is done with ICU
functionality?
Comment 3 ooo 2007-06-21 13:34:27 UTC
The word count itself is a Writer functionality, it should use the i18n break
iterator. If so, implementation is under i18npool/source/breakiterator/, for
Chinese specifically in i18npool/source/breakiterator/breakiterator_cjk.cxx,
which uses a dictionary approach to determine words, see
i18npool/source/breakiterator/data/zh.dic
Comment 4 Mathias_Bauer 2007-12-04 12:37:03 UTC
following release status meeting -> target 3.x
Comment 5 stefan.baltzer 2008-10-29 14:17:31 UTC
SBA: This issue has a target set but is still in state of "Unconfirmed".
Please re-check with OOo 3.0 or younger if it is (still) valid.
Then confirm it or set an appropriate resolution.
Thank you.
Comment 6 stefan.baltzer 2008-10-29 14:24:37 UTC
SBA: This issue has a target set but is still in state of "Unconfirmed".
Please re-check with OOo 3.0 or younger if it is (still) valid.
Then confirm it or set an appropriate resolution.
Thank you.
Comment 7 lohmaier 2009-09-28 23:09:34 UTC
confirming with OOo 3.1.1

inserting u+20000 into an empty document results in:

WordCount: 1       (as expected, improvement compared to the initial report)
CharacterCount: 2  (should be 1)

(instead of u+20000 one could use u+20027 - that has a glyph representation in
the code2001 font)

OOo's cursor movement treats the characters as one single character, i.e. one
keypress of left/right is enough to go past the character.

PS: easiest to reproduce with gtk's unicode input method: <ctrl>+<shift>+u, then
charactercode, then <enter>  (or keep ctrl+shift+u pressed while entering the
code, then release)
Comment 8 karl.hong 2009-09-29 01:28:58 UTC
Word count is handled by i18n work break iterator. For Chinese surrogate characters, which can not be 
processed by OOo dictionary based Chinese word break iterator currently, they will fallback to icu break 
iterator, which should count one word per character. As tested by cloph, it counts the character correctly.

Character count is handled by Writer itself, I don't think it calls character break iterator which can find 
correct character boundary for surrogate pair. 
Comment 9 ooo 2009-09-29 10:52:08 UTC
To whom it may concern: to iterate over and count characters in the internal
UTF-16 encoding use OUString::iterateCodePoints().
Comment 10 stefan.baltzer 2009-11-27 15:44:49 UTC
SBA: Reassigned to TL.
Comment 11 oooforum (fr) 2015-08-10 08:11:02 UTC
*** Issue 126252 has been marked as a duplicate of this issue. ***