Apache OpenOffice (AOO) Bugzilla – Issue 78127
Inaccuracy values in "Word Counts" when documents includes surrogates' characters
Last modified: 2015-08-10 08:11:02 UTC
Steps for reproducing: 1. Create a new document by writer. 2. Following Insert-> Special Characters, In Special Character dialog, select Font: "宋体-æ–¹æ£è¶…大å—符集", Mouse moving to select Characters: U+20000(code point in unicode) then press the "Ok" Button. 3. Open Tools -> Word Count, in Word Count dialog, Whole documents ---------------- Words: 0 Characters: 2 Desired Results: Whole documents ---------------- Words: 1 Characters: 1
Reassigned to SBA.
@er, khong: kangjch would like to investigate into this issue, but is unsure where to look. Does anybody know whether this word counting is done with ICU functionality?
The word count itself is a Writer functionality, it should use the i18n break iterator. If so, implementation is under i18npool/source/breakiterator/, for Chinese specifically in i18npool/source/breakiterator/breakiterator_cjk.cxx, which uses a dictionary approach to determine words, see i18npool/source/breakiterator/data/zh.dic
following release status meeting -> target 3.x
SBA: This issue has a target set but is still in state of "Unconfirmed". Please re-check with OOo 3.0 or younger if it is (still) valid. Then confirm it or set an appropriate resolution. Thank you.
confirming with OOo 3.1.1 inserting u+20000 into an empty document results in: WordCount: 1 (as expected, improvement compared to the initial report) CharacterCount: 2 (should be 1) (instead of u+20000 one could use u+20027 - that has a glyph representation in the code2001 font) OOo's cursor movement treats the characters as one single character, i.e. one keypress of left/right is enough to go past the character. PS: easiest to reproduce with gtk's unicode input method: <ctrl>+<shift>+u, then charactercode, then <enter> (or keep ctrl+shift+u pressed while entering the code, then release)
Word count is handled by i18n work break iterator. For Chinese surrogate characters, which can not be processed by OOo dictionary based Chinese word break iterator currently, they will fallback to icu break iterator, which should count one word per character. As tested by cloph, it counts the character correctly. Character count is handled by Writer itself, I don't think it calls character break iterator which can find correct character boundary for surrogate pair.
To whom it may concern: to iterate over and count characters in the internal UTF-16 encoding use OUString::iterateCodePoints().
SBA: Reassigned to TL.
*** Issue 126252 has been marked as a duplicate of this issue. ***