Issue 80815 - Word count differs from MS Word
Summary: Word count differs from MS Word
Status: CLOSED FIXED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOo 2.3
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@sw
URL:
Keywords: performance, usability
Depends on:
Blocks:
 
Reported: 2007-08-18 20:04 UTC by foobard
Modified: 2013-08-07 14:43 UTC (History)
9 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
test data for word count issue (19.50 KB, application/msword)
2008-03-11 18:06 UTC, foobard
no flags Details
Test results (18.64 KB, application/vnd.oasis.opendocument.spreadsheet)
2008-05-01 15:03 UTC, troodon
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description foobard 2007-08-18 20:04:31 UTC
OOs's word count feature is the only stumbling block in my organization to the
full adoption of OpenOffice in place of Microsoft Word -- and it is a HUGE barrier.

I am a professional writer. I get paid by the word, and I write to very tight
word counts (this is print media, not online). It is IMPERATIVE that my word
count is exactly the same as what my editor gets in Microsoft Word.

I work in an organization with more than three hundred writers, but until OOo
supports MS Word-style word count, we CANNOT change!

So yeah, OOo counts words correctly. Whatever. I don't care! Give me a tick box
in an options screen that says "Count Words like Microsoft Word Does" and let
those of us who need it -- have it!

Thank you!!!
Comment 1 shaunmcdonald131 2007-08-19 11:03:03 UTC
foobard: Can you please explain what the difference between the Microsoft Word and OpenOffice.org 
Writer word counts is? Examples are good. Without this information we are unable to fix the problem. 
Could the difference be something to do with headings?
Comment 2 foobard 2007-08-19 17:13:19 UTC
This is a well-known "feature", errr, "bug" in OOo. OOo counts hyphenated words
as two words. MS Word counts them as one words. There are several other
fundamental differences in how they count words. I never use headings in my
particular work, but in general I find OOo produces a word count on average
between 5% and 10% more than MS Word.

Even a cursory test of whatever document you have lying around will make it
clear to you the differences in word count.

I am not concerned with who counts "right" -- all I know is I need word count
compatibility to integrate OOo into my organization. Without this OOo is dead in
the water as far as management is concerned.

Thank you
foobard
Comment 3 michael.ruess 2007-08-20 09:35:45 UTC
A kind of "compatibility mode" for word count - or even a new statistics property...
A text like "This is toll-free use" will be stated as five words on OO Writer
and four words in MS Word.
Comment 4 foobard 2007-08-20 21:30:42 UTC
Another common situation in which MS Word and OpenOffice differ is in counting
special characters in words.

For instance, in my work I frequently use backslach escape codes for accents.
(This is necessary to make the back-end publishing software work properly.)

Examples of this would include fo\'sforos, archipi\'elago, do\^me, f**k, etc.
Each of these is counted as two words in OOo but only one word in MS Word.
Comment 5 Mathias_Bauer 2008-03-09 18:09:26 UTC
This is *not* a P1. Please learn how priorities are defined here before you
change them. Thank you.

I also think that this is a defect because that would mean that the way Word is
doing it is a kind of standard others had to follow what surely isn't true. But
OTOH it doesn't matter, so be it.
Comment 6 foobard 2008-03-09 18:32:32 UTC
This most certainly is a P1 for me and the tens of thousands of other
professional journalists who would gladly start using OpenOffice.org but can't
because the OOo word count isn't compatible with MS Word!

Look -- it sucks that Microsoft Word has a near monopoly on word processing
software. But whether you like it or not, Microsoft Word word count is a de
facto industry standard in the publishing industry. We are talking about
thousands and thousands and THOUSANDS of potential OOo users you will never
reach because you do not cater to our needs!

I have seen forum postings going back to 2003 complaining that OpenOffice
refuses to have a MS Word compatible word count. When are you guys going to
finally address this issue??
Comment 7 Mathias_Bauer 2008-03-09 18:52:46 UTC
The personal preference for an issue can't justify the priority. If we followed
that, *every* issue would be a P1. So we have rules about priorities. Everybody
should respect that.
Comment 8 frank.meies 2008-03-10 08:29:37 UTC
fme->fl: Do we really need a compatibility option for the word count? If not,
this shouldn't be too hard to fix. Karl would have to change the break iterator
mode WORD_COUNT. This can be done without any regression risk concerning other
functionality because this mode is solely used for counting words.

fme->khong:  What do you think?

fme->foobard: What would really help is a detailed list in which ways the OOo
word count differs from Word's one. So right now we now the hyphen and the
special character issue. Any others?

Comment 9 frank.loehmann 2008-03-10 09:37:28 UTC
No new option needed. If the industry uses that behavior as a standard, we
should support it by default. The clearer the input on how to count is, the
better our implementation will match this standard.
Comment 10 stefan.baltzer 2008-03-10 14:17:34 UTC
Statistically spoken, MS Word is a standard, yes. So this can be regarded and
solved as a defect. 

I believe that "in a perfect world" the word count feature should have more
options than a "Compatibility flag" or, even simpler, a change to "Words
behavior". The handling of dots and slashes in URLS, the use of "this and that
character or special character" being a seperator, a word for itself or
something "that binds words to one" and the like is likely to be different for
different purposes and especially in different languages. When writing so, I am
think about the Complext Text Layout languages like Khmer, Nepali, Arabic, Thai...

As a hint how a UI-driven "word count tuning could be realized, the so-called
"forbidden characters" in CJK languages already have such a "list" that can be
edited by the user. To access this, one has to enable CJK support first:
 - Tools-Options-Language Settings-Languages -> Check "CJK support", OK
 - Tools-Options-Language Settings-Asian Layout:
Here you can see the "forbidden character lists for the CJK languages that can
be "standard" or "user-defined".
Comment 11 foobard 2008-03-10 18:21:33 UTC
fme->foobard: What would really help is a detailed list in which ways the OOo
word count differs from Word's one. So right now we now the hyphen and the
special character issue. Any others?

fme -- 

well I can only speak from my own personal experience, using an English-language
version of OpenOfice.org. Here are some key test cases from my end:


battery-driven
f**k
and/or
apple(s)
money+opportunity
Micro$oft


each of these strings is counted as two words in OOo and one word in Microsoft
Word. If you can get test cases such as these to count as one word, we will be
90% of the way there. (After that I have large, complex documents that can help
furnish additional test cases if needed.)

Thank you so much for paying attention to this problem, and I do hope you can
fix it!
Comment 12 Mathias_Bauer 2008-03-10 18:34:11 UTC
Maybe the only word delimiters Word knows are whitespace, full stop, comma and
the like. This could be a start for a new implementation that could be used as a
base for further tests.

So how does Word treatt "300$" "I(not you)" "a****n" "1+3=4" ?
Comment 13 foobard 2008-03-10 18:54:14 UTC
mba said:
So how does Word treatt "300$" "I(not you)" "a****n" "1+3=4" ?


it appears to be a space-delimited issue. All of the above are one word in
Microsoft Word except for "I(not you)".

Comment 14 karl.hong 2008-03-10 19:10:13 UTC
Yes, I could change the breakiterator rule to count word in different way. Give
me an example. 
Comment 15 foobard 2008-03-10 23:22:00 UTC
doing some further testing, it looks like this could be as easy as white space
as the only delimiter -- strings like "aaaaaaa.aaaaaaa" or "aaaaaaa,aaaaaaa" or
even "aaaaaaa;aaaaaaa" are still counted as one single word in MS Word.
Comment 16 frank.meies 2008-03-11 08:04:36 UTC
fme->foobard: Which version of word did you use for testing? Are there
differences between the various versions? It would be fine if you could attach a
doc file which lists all the mentioned examples.
Comment 17 foobard 2008-03-11 18:02:30 UTC
foobard->fme: To the best of my knowledge (which is solely based on my
experience as an end user of MS Word for the last decade or so) Microsoft has
not varied the way it counts words. The version I am using is Word 2003. I don't
see where or how to attach a document to this issue, so I'm sending the .doc
file to you via email now. cheers!
Comment 18 foobard 2008-03-11 18:06:44 UTC
Created attachment 52036 [details]
test data for word count issue
Comment 19 karl.hong 2008-03-15 02:45:25 UTC
If no one objects, I would grab the issue and fix it for 3.0 beta.
Comment 20 karl.hong 2008-03-15 03:11:58 UTC
I have changed the rule, puctuations are now counted as part of word. 

I have one question. MS counts pure puctuations as a word, we define that a word
should contain at least one letter or number. 

Example, 'who need it -- have it!', MS counts 6 words, we count 5 words. 

Is it ok? or we should count punctuations also?
Comment 21 foobard 2008-03-15 03:39:46 UTC
foobard->khong: thanks for working on this!

From an end user point of view, what is desired is for OpenOffice.org and
Microsoft Word to count words exactly the same. Ideally, all test documents
should count exactly the same in both programs.

I realize it is a pain letting Microsoft set industry "standards" such as this,
but I'm afraid for those of us who rely on word count for our work, it's really
essential they match up exactly.

thanks again!
Comment 22 frank.meies 2008-03-15 06:57:14 UTC
fme->khong: I agree with foobard. If we change our word count to match the MS
word count, we shouldn't stop half way. One more question: The break iterator
code changes only affect the WORD_COUNT rule, is this correct?
Comment 23 karl.hong 2008-03-15 08:26:16 UTC
khong->fme, Yes, the change only applies to word count.

Ok, the puntuations I mentioned before are eaten by Writer. Here is my testing
program,

Sub Main
bi=createUnoService("com.sun.star.i18n.BreakIterator")
dim locale as new com.sun.star.lang.Locale
locale.Language="en"
wType=com.sun.star.i18n.WordType.WORD_COUNT

aStr="this -- is"

boundary=bi.nextWord(aStr, 1, locale, wType)
print boundary.startPos, boundary.endPos

End Sub

It prints "5, 7", '--' is a word, but Writer counts 2 words for the string, it
should count 3 words.

fms, could you take a look? My changes for i18npool is in cws i18n40. Thanks.
Comment 24 frank.meies 2008-03-15 10:32:06 UTC
fme->khong: You are right. The SwScanner skips 'words' that do not start with
letters. Please add the sw project to your cws.
Comment 25 karl.hong 2008-03-15 19:50:50 UTC
khong->fme, project sw is now in cws i18n40, could you work on it? 
Comment 26 frank.meies 2008-03-18 09:59:07 UTC
fme->khong: I committed my changed, see sw/source/core/txtnode/txtedt.cxx.
Please verify that everything works correctly.
Comment 27 karl.hong 2008-03-18 23:45:17 UTC
There is one case that is not easy to implement. 

"This--is" and "--help" are 2 words in MS Word, while "This-is", "-help", "----"
and "this--" are 1 word.  It seems only for dash, other punctuations are not
like this.

We treat all cases as 1 word in current implementation, and it is consistent for
all punctuations.

If that is acceptable, I can commit this cws, otherwise I have to twist the
rule, and I am not sure how difficult it will be.
Comment 28 foobard 2008-03-19 00:01:17 UTC
foobard->khong

I am sorry this is difficult to implement.

However, either OpenOffice.org counts words *exactly* like MS Word, or it is
completely useless to those of us need this feature.

Are you sure it would not be possible to implement this peculiar "--" case?


Comment 29 karl.hong 2008-03-19 01:54:59 UTC
khong->foobard, theoretically we could not do 100% or *exactly* like MS Word,
since we are doing reverse engineering on a black box, not implementing by spec. 

Even we have their implementation spec, this case may be their bug, I don't have
good explanation why 2 dashs is word separator, 1 dash or 2 "+" is not, and why
2 dash is a word when it appears in the front of a word,  and not a word when it
appends to a word.

I found this case is because I copy/paste a Unix man page for testing, it has 2
dash for command line option. 

And I believe something we still don't know in their implementation.

So the point is how important this case is and how frequently the case will
appear in normal document.

I can try to implement it, but I don't want to spend time to implement their
bug, could you give me a good explanation.
Comment 30 foobard 2008-03-19 02:16:05 UTC
foobard->khong

I'm a freelance journalist. About half of the time I get paid by the word. Or
rather, I get paid by the word count as counted in Microsoft Word.

Further, I work to fixed word counts. If my editor says, 20,000 words, and I
submit a manuscript that's 20,005, it makes me look like an idiot. It would be
like a programmer forgetting to include a matching curly brace -- it's so basic
that if you can't do it right, you look very unprofessional.

This issue affects all journalists and writers, for whom word count is the
fundamental basis of their business. As I've written before, there are tens of
thousands of us who would gladly switch to OOo if they could. Freelance
journalists as a rule don't make a lot of money -- a lot less than programmers
anyway -- so we are an ideal target audience to expand the OOo user base.

I realize that you are attempting to reverse engineer in a black box situation.
As a one-time programmer myself, I can appreciate the difficulty of such a task,
and I thank you for making a start. But unless you finish the task, unless we
make every effort to make OOo word count compatible with MS Word -- including
any bugs that might exist -- then all the work you've put into this issue is of
no use to me nor to any of the other writers out there.

In "Writer Land", we live and die by word count. It is the measure of all
things, and the final arbiter of all disagreements. It is law. And it is law as
dictated by Microsoft.

Please consider implementing this Microsoft bug.
Comment 31 karl.hong 2008-03-19 07:28:01 UTC
Ok, that is done, dash is implemented as MS style.

Next difference, (The exploring to the black box will never be ended), is for
Writer.

khong->fme.

MS counts bullet/numbering as 1 word, we don't count it. 

To reproduce, in MS Word, click on bullet, the word count is 0, type a word,
word count is 2. But OOo counts as 1. 

This must be implemented in Writer, which has to pass bullet to breakiterator.
Comment 32 frank.meies 2008-03-19 12:38:22 UTC
fme->khong, foobard: This is harder that expected. Fields have to be expanded as
well. For this I changed some more Writer code. Please verify.

fme->QA: Please test Word count thoroughly. Fields are now expanded before
counting, numbering/bullets are also counted.
Comment 33 karl.hong 2008-03-20 03:13:41 UTC
Ready for QA.
Comment 34 karl.hong 2008-03-20 03:15:24 UTC
.
Comment 35 foobard 2008-03-21 22:35:23 UTC
something for QA -- 

I note in issue 17964 in comment by sajer Sat Jan 31 the following:

"It is no coincidence that Microsoft Word XP enhanced its 
word count feature, thats simply because people needed it and asked 
for more!"

It is possible that Word XP's word count differs from previous versions. This
might be worth double checking.

also note comment by miller_dscott Thu Mar 25:

"Many Asian languages are not counted in words, but in characters. In order for
the word count tool to be useful with Asian languages, it needs to be able to
distinguish between Asian and non-Asian characters and produce independent
counts for Asian characters and non-Asian words.
 It's also important that the "Asian character count" not include
Asian/double-byte spaces, or, that it show both Asian character total with
spaces and Asian character total without spaces."

erikanderson3 Thu May 6 08:59:33 +0000 2004 also chimes in with:

"As another Japanese -> English translator, I would like to second
miller_dscott's comments.  A word/character count function that distinguishes
between Asian and non-Asian text is vital.  As mentioned in a post over on the
OpenOffice.org Forum
(http://www.oooforum.org/forum/viewtopic.php?p=23214#23214), I currently get
rather silly results with OOo.  A sample paragraph just pasted in from the front
page of http://www.nikkei.com shows 135 Asian characters in Word, and 78 'words'
in Writer.  Again, the very concept of 'word' is irrelevant for counting
Japanese, as everyone goes by character count, not including spaces.  I
understand it's similar for Chinese."

As I don't use OOo with CJK support I can't comment on this.


hth
Comment 36 karl.hong 2008-03-21 23:03:00 UTC
We only fixed word count for Westen languages in this issue.

For CJK and CTL, we need to inverstigate native language version of MS Word,
unlike OOo, MS may have different implementation on native language versions.
Comment 37 stefan.baltzer 2008-03-27 14:14:40 UTC
Adjusted summary. Verified in CWS i18n40.
Comment 38 foobard 2008-04-20 20:21:44 UTC
Has this fix been included in the most recent 3.0 dev snapshot?

I just downloaded the latest .deb, installed it on Ubuntu, opened up a 21,660
word document that 3.0 tells me only contains 910 words.
Comment 39 foobard 2008-04-20 22:01:23 UTC
update: the word count feature is absolutely spot on in the Windows version of
OOo 3.0. It would appear the weirdness is only in the Linux .deb version. (I
haven't tested on a Mac, don't have access to one.)

thanks for fixing this problem!
Comment 40 Mathias_Bauer 2008-04-20 22:17:36 UTC
EIS tells me: integrated in m6.
If you haved used this or a newer version: please attach sample document. A
difference of this size sounds strange,
Comment 41 troodon 2008-04-20 22:31:46 UTC
->foobard

I think you're seeing the bug described in issue 88484. Maybe that's a
regression introduced by the changes for this issue.
Comment 42 foobard 2008-04-20 22:39:01 UTC
I'm afraid I can't attach the document in question, as it's covered by an NDA.

However, according to Properties->Statistics, the number of paragraphs is 667
and the number of words 682. I think troodon may be right.
Comment 43 troodon 2008-04-21 00:07:10 UTC
I've discovered a little problem with bullet/numbering counting. See issue 88509.
Comment 44 troodon 2008-04-22 14:48:20 UTC
Tested with four Word documents and the word count DEV300 m9 (on Windows, so not
affected by issue 88484) gives *never* matches the word count given by Word 2000.
In my test OOo always counted more words than Word, smaller delta being +8 and
the biggest one +162.

Later I'll try with more files an I'll attach a Cacl sheet with the results.
Shoud I open another issue or should I post here?

I can't publicly post the four files I tested with, but I could send them to a
developer.
Comment 45 foobard 2008-04-22 17:03:35 UTC
-> troodon

if you like, send them to me. I'm testing with MS Office 2003, perhaps there are
differences between how 2000 and 2003 count words. If nothing else, I can check
and see if I get the same results you do.
Comment 46 foobard 2008-04-22 17:06:15 UTC
Also, has this fix addressed bug 86537?

I don't use footnotes and headers and footers so I'm not sure.
Comment 47 troodon 2008-05-01 15:01:44 UTC
I've compiled a sheet with 17 publicly available files. From those 17 files,
DEV300_m9 only gave the same word count as Word 2000 for two files (one txt file
and one Word file), so I consider this issue isn't fixed.
Comment 48 troodon 2008-05-01 15:03:09 UTC
Created attachment 53301 [details]
Test  results
Comment 49 Mathias_Bauer 2008-05-02 15:14:21 UTC
Thanks for the test data. We will have a look on it.
fme: please take over if time permits.
Comment 50 frank.meies 2008-05-05 08:34:01 UTC
fme->troodon/foobard: Please open a new issue in case that there are still some
problems with the word count or differences to the MS word count. It would be
very helpful it the two of you could provide us with a detailed description of
what kind of words are actually causing the differences.
Comment 51 foobard 2008-05-05 20:32:41 UTC
Working through troodon's spreadsheet, I had a look at the most egregious
example,
http://www.nursesreg.health.nsw.gov.au/assets/healthamms1/policies_procedures_sec1.doc.

Looks like complex table of contents created in MS Word just don't render in
OOo, and since the words don't "exist", they can't be counted either.

I've reported this as bug 89038.

I tested in the latest beta but that wasn't an option in the dropdown menu.
Comment 52 foobard 2008-05-05 20:46:31 UTC
also bug 89040. Looks like OOo is not counting MS Word created form fields (and
maybe table content) correctly.

Comment 53 foobard 2008-05-05 20:59:06 UTC
also appears to be a problem with text in MS Word-created text boxes, see bug 89041.
Comment 54 foobard 2008-05-05 21:25:56 UTC
see also bug 89042

the test case

(a ‘salvage function’)

counts as three words in MS Word but four in the latest OOo beta. 
Comment 55 stefan.baltzer 2008-07-15 18:39:20 UTC
SBA: The attached test_data.doc has 24 words in MS Word 2007 and 24 words in OOo
DEV300_m24. Issue re-verified in master build and closed.

Please comment on "other issues" in the respective issue. :-)