Apache OpenOffice (AOO) Bugzilla – Issue 80815
Word count differs from MS Word
Last modified: 2013-08-07 14:43:03 UTC
OOs's word count feature is the only stumbling block in my organization to the full adoption of OpenOffice in place of Microsoft Word -- and it is a HUGE barrier. I am a professional writer. I get paid by the word, and I write to very tight word counts (this is print media, not online). It is IMPERATIVE that my word count is exactly the same as what my editor gets in Microsoft Word. I work in an organization with more than three hundred writers, but until OOo supports MS Word-style word count, we CANNOT change! So yeah, OOo counts words correctly. Whatever. I don't care! Give me a tick box in an options screen that says "Count Words like Microsoft Word Does" and let those of us who need it -- have it! Thank you!!!
foobard: Can you please explain what the difference between the Microsoft Word and OpenOffice.org Writer word counts is? Examples are good. Without this information we are unable to fix the problem. Could the difference be something to do with headings?
This is a well-known "feature", errr, "bug" in OOo. OOo counts hyphenated words as two words. MS Word counts them as one words. There are several other fundamental differences in how they count words. I never use headings in my particular work, but in general I find OOo produces a word count on average between 5% and 10% more than MS Word. Even a cursory test of whatever document you have lying around will make it clear to you the differences in word count. I am not concerned with who counts "right" -- all I know is I need word count compatibility to integrate OOo into my organization. Without this OOo is dead in the water as far as management is concerned. Thank you foobard
A kind of "compatibility mode" for word count - or even a new statistics property... A text like "This is toll-free use" will be stated as five words on OO Writer and four words in MS Word.
Another common situation in which MS Word and OpenOffice differ is in counting special characters in words. For instance, in my work I frequently use backslach escape codes for accents. (This is necessary to make the back-end publishing software work properly.) Examples of this would include fo\'sforos, archipi\'elago, do\^me, f**k, etc. Each of these is counted as two words in OOo but only one word in MS Word.
This is *not* a P1. Please learn how priorities are defined here before you change them. Thank you. I also think that this is a defect because that would mean that the way Word is doing it is a kind of standard others had to follow what surely isn't true. But OTOH it doesn't matter, so be it.
This most certainly is a P1 for me and the tens of thousands of other professional journalists who would gladly start using OpenOffice.org but can't because the OOo word count isn't compatible with MS Word! Look -- it sucks that Microsoft Word has a near monopoly on word processing software. But whether you like it or not, Microsoft Word word count is a de facto industry standard in the publishing industry. We are talking about thousands and thousands and THOUSANDS of potential OOo users you will never reach because you do not cater to our needs! I have seen forum postings going back to 2003 complaining that OpenOffice refuses to have a MS Word compatible word count. When are you guys going to finally address this issue??
The personal preference for an issue can't justify the priority. If we followed that, *every* issue would be a P1. So we have rules about priorities. Everybody should respect that.
fme->fl: Do we really need a compatibility option for the word count? If not, this shouldn't be too hard to fix. Karl would have to change the break iterator mode WORD_COUNT. This can be done without any regression risk concerning other functionality because this mode is solely used for counting words. fme->khong: What do you think? fme->foobard: What would really help is a detailed list in which ways the OOo word count differs from Word's one. So right now we now the hyphen and the special character issue. Any others?
No new option needed. If the industry uses that behavior as a standard, we should support it by default. The clearer the input on how to count is, the better our implementation will match this standard.
Statistically spoken, MS Word is a standard, yes. So this can be regarded and solved as a defect. I believe that "in a perfect world" the word count feature should have more options than a "Compatibility flag" or, even simpler, a change to "Words behavior". The handling of dots and slashes in URLS, the use of "this and that character or special character" being a seperator, a word for itself or something "that binds words to one" and the like is likely to be different for different purposes and especially in different languages. When writing so, I am think about the Complext Text Layout languages like Khmer, Nepali, Arabic, Thai... As a hint how a UI-driven "word count tuning could be realized, the so-called "forbidden characters" in CJK languages already have such a "list" that can be edited by the user. To access this, one has to enable CJK support first: - Tools-Options-Language Settings-Languages -> Check "CJK support", OK - Tools-Options-Language Settings-Asian Layout: Here you can see the "forbidden character lists for the CJK languages that can be "standard" or "user-defined".
fme->foobard: What would really help is a detailed list in which ways the OOo word count differs from Word's one. So right now we now the hyphen and the special character issue. Any others? fme -- well I can only speak from my own personal experience, using an English-language version of OpenOfice.org. Here are some key test cases from my end: battery-driven f**k and/or apple(s) money+opportunity Micro$oft each of these strings is counted as two words in OOo and one word in Microsoft Word. If you can get test cases such as these to count as one word, we will be 90% of the way there. (After that I have large, complex documents that can help furnish additional test cases if needed.) Thank you so much for paying attention to this problem, and I do hope you can fix it!
Maybe the only word delimiters Word knows are whitespace, full stop, comma and the like. This could be a start for a new implementation that could be used as a base for further tests. So how does Word treatt "300$" "I(not you)" "a****n" "1+3=4" ?
mba said: So how does Word treatt "300$" "I(not you)" "a****n" "1+3=4" ? it appears to be a space-delimited issue. All of the above are one word in Microsoft Word except for "I(not you)".
Yes, I could change the breakiterator rule to count word in different way. Give me an example.
doing some further testing, it looks like this could be as easy as white space as the only delimiter -- strings like "aaaaaaa.aaaaaaa" or "aaaaaaa,aaaaaaa" or even "aaaaaaa;aaaaaaa" are still counted as one single word in MS Word.
fme->foobard: Which version of word did you use for testing? Are there differences between the various versions? It would be fine if you could attach a doc file which lists all the mentioned examples.
foobard->fme: To the best of my knowledge (which is solely based on my experience as an end user of MS Word for the last decade or so) Microsoft has not varied the way it counts words. The version I am using is Word 2003. I don't see where or how to attach a document to this issue, so I'm sending the .doc file to you via email now. cheers!
Created attachment 52036 [details] test data for word count issue
If no one objects, I would grab the issue and fix it for 3.0 beta.
I have changed the rule, puctuations are now counted as part of word. I have one question. MS counts pure puctuations as a word, we define that a word should contain at least one letter or number. Example, 'who need it -- have it!', MS counts 6 words, we count 5 words. Is it ok? or we should count punctuations also?
foobard->khong: thanks for working on this! From an end user point of view, what is desired is for OpenOffice.org and Microsoft Word to count words exactly the same. Ideally, all test documents should count exactly the same in both programs. I realize it is a pain letting Microsoft set industry "standards" such as this, but I'm afraid for those of us who rely on word count for our work, it's really essential they match up exactly. thanks again!
fme->khong: I agree with foobard. If we change our word count to match the MS word count, we shouldn't stop half way. One more question: The break iterator code changes only affect the WORD_COUNT rule, is this correct?
khong->fme, Yes, the change only applies to word count. Ok, the puntuations I mentioned before are eaten by Writer. Here is my testing program, Sub Main bi=createUnoService("com.sun.star.i18n.BreakIterator") dim locale as new com.sun.star.lang.Locale locale.Language="en" wType=com.sun.star.i18n.WordType.WORD_COUNT aStr="this -- is" boundary=bi.nextWord(aStr, 1, locale, wType) print boundary.startPos, boundary.endPos End Sub It prints "5, 7", '--' is a word, but Writer counts 2 words for the string, it should count 3 words. fms, could you take a look? My changes for i18npool is in cws i18n40. Thanks.
fme->khong: You are right. The SwScanner skips 'words' that do not start with letters. Please add the sw project to your cws.
khong->fme, project sw is now in cws i18n40, could you work on it?
fme->khong: I committed my changed, see sw/source/core/txtnode/txtedt.cxx. Please verify that everything works correctly.
There is one case that is not easy to implement. "This--is" and "--help" are 2 words in MS Word, while "This-is", "-help", "----" and "this--" are 1 word. It seems only for dash, other punctuations are not like this. We treat all cases as 1 word in current implementation, and it is consistent for all punctuations. If that is acceptable, I can commit this cws, otherwise I have to twist the rule, and I am not sure how difficult it will be.
foobard->khong I am sorry this is difficult to implement. However, either OpenOffice.org counts words *exactly* like MS Word, or it is completely useless to those of us need this feature. Are you sure it would not be possible to implement this peculiar "--" case?
khong->foobard, theoretically we could not do 100% or *exactly* like MS Word, since we are doing reverse engineering on a black box, not implementing by spec. Even we have their implementation spec, this case may be their bug, I don't have good explanation why 2 dashs is word separator, 1 dash or 2 "+" is not, and why 2 dash is a word when it appears in the front of a word, and not a word when it appends to a word. I found this case is because I copy/paste a Unix man page for testing, it has 2 dash for command line option. And I believe something we still don't know in their implementation. So the point is how important this case is and how frequently the case will appear in normal document. I can try to implement it, but I don't want to spend time to implement their bug, could you give me a good explanation.
foobard->khong I'm a freelance journalist. About half of the time I get paid by the word. Or rather, I get paid by the word count as counted in Microsoft Word. Further, I work to fixed word counts. If my editor says, 20,000 words, and I submit a manuscript that's 20,005, it makes me look like an idiot. It would be like a programmer forgetting to include a matching curly brace -- it's so basic that if you can't do it right, you look very unprofessional. This issue affects all journalists and writers, for whom word count is the fundamental basis of their business. As I've written before, there are tens of thousands of us who would gladly switch to OOo if they could. Freelance journalists as a rule don't make a lot of money -- a lot less than programmers anyway -- so we are an ideal target audience to expand the OOo user base. I realize that you are attempting to reverse engineer in a black box situation. As a one-time programmer myself, I can appreciate the difficulty of such a task, and I thank you for making a start. But unless you finish the task, unless we make every effort to make OOo word count compatible with MS Word -- including any bugs that might exist -- then all the work you've put into this issue is of no use to me nor to any of the other writers out there. In "Writer Land", we live and die by word count. It is the measure of all things, and the final arbiter of all disagreements. It is law. And it is law as dictated by Microsoft. Please consider implementing this Microsoft bug.
Ok, that is done, dash is implemented as MS style. Next difference, (The exploring to the black box will never be ended), is for Writer. khong->fme. MS counts bullet/numbering as 1 word, we don't count it. To reproduce, in MS Word, click on bullet, the word count is 0, type a word, word count is 2. But OOo counts as 1. This must be implemented in Writer, which has to pass bullet to breakiterator.
fme->khong, foobard: This is harder that expected. Fields have to be expanded as well. For this I changed some more Writer code. Please verify. fme->QA: Please test Word count thoroughly. Fields are now expanded before counting, numbering/bullets are also counted.
Ready for QA.
.
something for QA -- I note in issue 17964 in comment by sajer Sat Jan 31 the following: "It is no coincidence that Microsoft Word XP enhanced its word count feature, thats simply because people needed it and asked for more!" It is possible that Word XP's word count differs from previous versions. This might be worth double checking. also note comment by miller_dscott Thu Mar 25: "Many Asian languages are not counted in words, but in characters. In order for the word count tool to be useful with Asian languages, it needs to be able to distinguish between Asian and non-Asian characters and produce independent counts for Asian characters and non-Asian words. It's also important that the "Asian character count" not include Asian/double-byte spaces, or, that it show both Asian character total with spaces and Asian character total without spaces." erikanderson3 Thu May 6 08:59:33 +0000 2004 also chimes in with: "As another Japanese -> English translator, I would like to second miller_dscott's comments. A word/character count function that distinguishes between Asian and non-Asian text is vital. As mentioned in a post over on the OpenOffice.org Forum (http://www.oooforum.org/forum/viewtopic.php?p=23214#23214), I currently get rather silly results with OOo. A sample paragraph just pasted in from the front page of http://www.nikkei.com shows 135 Asian characters in Word, and 78 'words' in Writer. Again, the very concept of 'word' is irrelevant for counting Japanese, as everyone goes by character count, not including spaces. I understand it's similar for Chinese." As I don't use OOo with CJK support I can't comment on this. hth
We only fixed word count for Westen languages in this issue. For CJK and CTL, we need to inverstigate native language version of MS Word, unlike OOo, MS may have different implementation on native language versions.
Adjusted summary. Verified in CWS i18n40.
Has this fix been included in the most recent 3.0 dev snapshot? I just downloaded the latest .deb, installed it on Ubuntu, opened up a 21,660 word document that 3.0 tells me only contains 910 words.
update: the word count feature is absolutely spot on in the Windows version of OOo 3.0. It would appear the weirdness is only in the Linux .deb version. (I haven't tested on a Mac, don't have access to one.) thanks for fixing this problem!
EIS tells me: integrated in m6. If you haved used this or a newer version: please attach sample document. A difference of this size sounds strange,
->foobard I think you're seeing the bug described in issue 88484. Maybe that's a regression introduced by the changes for this issue.
I'm afraid I can't attach the document in question, as it's covered by an NDA. However, according to Properties->Statistics, the number of paragraphs is 667 and the number of words 682. I think troodon may be right.
I've discovered a little problem with bullet/numbering counting. See issue 88509.
Tested with four Word documents and the word count DEV300 m9 (on Windows, so not affected by issue 88484) gives *never* matches the word count given by Word 2000. In my test OOo always counted more words than Word, smaller delta being +8 and the biggest one +162. Later I'll try with more files an I'll attach a Cacl sheet with the results. Shoud I open another issue or should I post here? I can't publicly post the four files I tested with, but I could send them to a developer.
-> troodon if you like, send them to me. I'm testing with MS Office 2003, perhaps there are differences between how 2000 and 2003 count words. If nothing else, I can check and see if I get the same results you do.
Also, has this fix addressed bug 86537? I don't use footnotes and headers and footers so I'm not sure.
I've compiled a sheet with 17 publicly available files. From those 17 files, DEV300_m9 only gave the same word count as Word 2000 for two files (one txt file and one Word file), so I consider this issue isn't fixed.
Created attachment 53301 [details] Test results
Thanks for the test data. We will have a look on it. fme: please take over if time permits.
fme->troodon/foobard: Please open a new issue in case that there are still some problems with the word count or differences to the MS word count. It would be very helpful it the two of you could provide us with a detailed description of what kind of words are actually causing the differences.
Working through troodon's spreadsheet, I had a look at the most egregious example, http://www.nursesreg.health.nsw.gov.au/assets/healthamms1/policies_procedures_sec1.doc. Looks like complex table of contents created in MS Word just don't render in OOo, and since the words don't "exist", they can't be counted either. I've reported this as bug 89038. I tested in the latest beta but that wasn't an option in the dropdown menu.
also bug 89040. Looks like OOo is not counting MS Word created form fields (and maybe table content) correctly.
also appears to be a problem with text in MS Word-created text boxes, see bug 89041.
see also bug 89042 the test case (a ‘salvage function’) counts as three words in MS Word but four in the latest OOo beta.
SBA: The attached test_data.doc has 24 words in MS Word 2007 and 24 words in OOo DEV300_m24. Issue re-verified in master build and closed. Please comment on "other issues" in the respective issue. :-)