Issue 78127 - Inaccuracy values in "Word Counts" when documents includes surrogates' characters
Summary: Inaccuracy values in "Word Counts" when documents includes surrogates' charac...
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOo 2.2.1 RC2
Hardware: PC All
: P3 Trivial with 1 vote (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: oooqa
: MoHu (view as issue list)
Depends on:
Blocks:
 
Reported: 2007-06-06 03:32 UTC by kangjch
Modified: 2015-08-10 08:11 UTC (History)
10 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments

Note You need to log in before you can comment on or make changes to this issue.
Description kangjch 2007-06-06 03:32:26 UTC
Steps for reproducing:
   1. Create a new document by writer.
   2. Following Insert-> Special Characters, In Special Character dialog,
      select Font: "宋体-方正超大字符集",
      Mouse moving to select Characters: U+20000(code point in unicode)
      then press the "Ok" Button.
   3. Open Tools -> Word Count, in Word Count dialog, 

        Whole documents ----------------
            Words:        0
            Characters:   2 

Desired Results: 

        Whole documents ----------------
            Words:        1
            Characters:   1
Comment 1 michael.ruess 2007-06-06 09:21:41 UTC
Reassigned to SBA.
Comment 2 Stephan Bergmann 2007-06-21 09:35:49 UTC
@er, khong:  kangjch would like to investigate into this issue, but is unsure
where to look.  Does anybody know whether this word counting is done with ICU
functionality?
Comment 3 ooo 2007-06-21 13:34:27 UTC
The word count itself is a Writer functionality, it should use the i18n break
iterator. If so, implementation is under i18npool/source/breakiterator/, for
Chinese specifically in i18npool/source/breakiterator/breakiterator_cjk.cxx,
which uses a dictionary approach to determine words, see
i18npool/source/breakiterator/data/zh.dic
Comment 4 Mathias_Bauer 2007-12-04 12:37:03 UTC
following release status meeting -> target 3.x
Comment 5 stefan.baltzer 2008-10-29 14:17:31 UTC
SBA: This issue has a target set but is still in state of "Unconfirmed".
Please re-check with OOo 3.0 or younger if it is (still) valid.
Then confirm it or set an appropriate resolution.
Thank you.
Comment 6 stefan.baltzer 2008-10-29 14:24:37 UTC
SBA: This issue has a target set but is still in state of "Unconfirmed".
Please re-check with OOo 3.0 or younger if it is (still) valid.
Then confirm it or set an appropriate resolution.
Thank you.
Comment 7 lohmaier 2009-09-28 23:09:34 UTC
confirming with OOo 3.1.1

inserting u+20000 into an empty document results in:

WordCount: 1       (as expected, improvement compared to the initial report)
CharacterCount: 2  (should be 1)

(instead of u+20000 one could use u+20027 - that has a glyph representation in
the code2001 font)

OOo's cursor movement treats the characters as one single character, i.e. one
keypress of left/right is enough to go past the character.

PS: easiest to reproduce with gtk's unicode input method: <ctrl>+<shift>+u, then
charactercode, then <enter>  (or keep ctrl+shift+u pressed while entering the
code, then release)
Comment 8 karl.hong 2009-09-29 01:28:58 UTC
Word count is handled by i18n work break iterator. For Chinese surrogate characters, which can not be 
processed by OOo dictionary based Chinese word break iterator currently, they will fallback to icu break 
iterator, which should count one word per character. As tested by cloph, it counts the character correctly.

Character count is handled by Writer itself, I don't think it calls character break iterator which can find 
correct character boundary for surrogate pair. 
Comment 9 ooo 2009-09-29 10:52:08 UTC
To whom it may concern: to iterate over and count characters in the internal
UTF-16 encoding use OUString::iterateCodePoints().
Comment 10 stefan.baltzer 2009-11-27 15:44:49 UTC
SBA: Reassigned to TL.
Comment 11 oooforum (fr) 2015-08-10 08:11:02 UTC
*** Issue 126252 has been marked as a duplicate of this issue. ***