78127 – Inaccuracy values in "Word Counts" when documents includes surrogates' characters

Issue 78127 - Inaccuracy values in "Word Counts" when documents includes surrogates' characters

Summary: Inaccuracy values in "Word Counts" when documents includes surrogates' charac...

Status:	CONFIRMED

Alias:	None

Product:	Writer
Classification:	Application
Component:	code (show other issues)
Version:	OOo 2.2.1 RC2
Hardware:	PC All

Importance:	P3 Trivial with 1 vote (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:	oooqa

Duplicates (1):	MoHu (view as issue list)
Depends on:
Blocks:

Reported:	2007-06-06 03:32 UTC by kangjch
Modified:	2015-08-10 08:11 UTC (History)
CC List:	10 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description kangjch 2007-06-06 03:32:26 UTC

Steps for reproducing:
   1. Create a new document by writer.
   2. Following Insert-> Special Characters, In Special Character dialog,
      select Font: "å®‹ä½“-æ–¹æ£è¶…å¤§å—ç¬¦é›†",
      Mouse moving to select Characters: U+20000(code point in unicode)
      then press the "Ok" Button.
   3. Open Tools -> Word Count, in Word Count dialog, 

        Whole documents ----------------
            Words:        0
            Characters:   2 

Desired Results: 

        Whole documents ----------------
            Words:        1
            Characters:   1

Comment 1 michael.ruess 2007-06-06 09:21:41 UTC

Reassigned to SBA.

Comment 2 Stephan Bergmann 2007-06-21 09:35:49 UTC

@er, khong:  kangjch would like to investigate into this issue, but is unsure
where to look.  Does anybody know whether this word counting is done with ICU
functionality?

Comment 3 ooo 2007-06-21 13:34:27 UTC

The word count itself is a Writer functionality, it should use the i18n break
iterator. If so, implementation is under i18npool/source/breakiterator/, for
Chinese specifically in i18npool/source/breakiterator/breakiterator_cjk.cxx,
which uses a dictionary approach to determine words, see
i18npool/source/breakiterator/data/zh.dic

Comment 4 Mathias_Bauer 2007-12-04 12:37:03 UTC

following release status meeting -> target 3.x

Comment 5 stefan.baltzer 2008-10-29 14:17:31 UTC

SBA: This issue has a target set but is still in state of "Unconfirmed".
Please re-check with OOo 3.0 or younger if it is (still) valid.
Then confirm it or set an appropriate resolution.
Thank you.

Comment 6 stefan.baltzer 2008-10-29 14:24:37 UTC

SBA: This issue has a target set but is still in state of "Unconfirmed".
Please re-check with OOo 3.0 or younger if it is (still) valid.
Then confirm it or set an appropriate resolution.
Thank you.

Comment 7 lohmaier 2009-09-28 23:09:34 UTC

confirming with OOo 3.1.1

inserting u+20000 into an empty document results in:

WordCount: 1       (as expected, improvement compared to the initial report)
CharacterCount: 2  (should be 1)

(instead of u+20000 one could use u+20027 - that has a glyph representation in
the code2001 font)

OOo's cursor movement treats the characters as one single character, i.e. one
keypress of left/right is enough to go past the character.

PS: easiest to reproduce with gtk's unicode input method: <ctrl>+<shift>+u, then
charactercode, then <enter>  (or keep ctrl+shift+u pressed while entering the
code, then release)

Comment 8 karl.hong 2009-09-29 01:28:58 UTC

Word count is handled by i18n work break iterator. For Chinese surrogate characters, which can not be 
processed by OOo dictionary based Chinese word break iterator currently, they will fallback to icu break 
iterator, which should count one word per character. As tested by cloph, it counts the character correctly.

Character count is handled by Writer itself, I don't think it calls character break iterator which can find 
correct character boundary for surrogate pair.

Comment 9 ooo 2009-09-29 10:52:08 UTC

To whom it may concern: to iterate over and count characters in the internal
UTF-16 encoding use OUString::iterateCodePoints().

Comment 10 stefan.baltzer 2009-11-27 15:44:49 UTC

SBA: Reassigned to TL.

Comment 11 oooforum (fr) 2015-08-10 08:11:02 UTC

*** Issue 126252 has been marked as a duplicate of this issue. ***