Apache OpenOffice (AOO) Bugzilla – Issue 92383
submit new en_US.dic without the errors
Last modified: 2017-05-20 10:03:18 UTC
I have completed my own en_US.dic for spell checking in Open Office. This was essential. There are a large number of errors in the existing Open Office en_US.dic spelling dictionary. This is not surprising; it seems that someone used Microsoft Word to check online word lists, such as the 110,000 word list that is commonly available, and then included the approved results in the Open Office spelling dictionary. The problem is that that 110,000 word list, which is advertised as suitable for spell checking, contains about 8,000 errors. Many of those errors ended up in Microsoft Word itself. Anyone using that word list, or relying on Microsoft Word to produce error-free dictionaries, is going to end up with a spell checker that is riddled with errors. As an example, when I look at Microsoft Word or Open Office (the errors are the same, except that Open Office has more of them), I see things like airbag [air bag], airbase [air base], pointblank [point-blank], teabag [tea bag], tealeaves [tea leaves], sanserif [sans serif], Roobbie [?], slowcoaches [?], antisemitic, antisemitism [anti-Semitic, anti-Semitism], rightsize, eageyness, or Rafaellle. Very few English words have 3 L's, yet if you use a simple search on en_US.dic, you will find other words besides Rafaellle with 3 L's. There is little sense trying to list every problem. I also have serious difficulties with the word choices (again, all these problems stem from MS Word). The famous aviator is Lindbergh, so why put the name Lindberg in a spell checker and create problems for students? The name of the country is Liechtenstein, so why put Lichtenstein in a spell checker? (There is a Roy Lichtenstein but, apologies to Roy, nobody cares. Most people want the name of the country.) I thought that to remove the garbage from en_US.dic I might have to take out 3,000 or 4,000 words, but the actual number was much higher. I have seen published novels that relied on Microsoft Word. They may contained a dozen or more misspelled words. Every error jerks a reader out of the illusion created by the writing, and causes the reader to question the writer's credibility. After about six errors many readers consider discarding a book. Professionally produced books should not contain any errors--not one. I have two novels on Amazon.com. I created my own spelling checker in 1993, and every few years I revised the word list, so I have been at this for a while. In 2001, I released WORDFUN2.ZIP (118,000 word list), which is on simtel.net. That spell checker contained many words suitable for Scrabble play. I used a different spell checker for producing books. My spelling checker was updated in 2003 and in 2006, and in May-July 2008 I did a complete check of the word list against published dictionaries, and integrated words from Open Office en_US.dic. I typically use http://dictionary.reference.com. Entered words are checked against the American Heritage Dictionary (my favorite), and the Random House Unabridged Dictionary (very good), and a Webster's Unabridged (can be questionable as it is so inclusive) and against WordNet (not to be trusted, as some of their choices are flat out wrong). My dictionary is very close in size to the existing en_US.dic used in Open Office. I would like to offer it as an alternative for writers or business professionals who don't want to look like idiots. (My dictionary doesn't contain "alright." Nonprofessionals think it's just fine to use "alright," and will start screaming and spitting at you that "alright" is a word. Every writer I know thinks the usage of that word is a sure indication of illiteracy. The correct usage is "all right.") I realize that most people won't care. I used to complain that the online 110,000 English word list, recommended for use in spell checkers, contained 8,000 misspellings. Nobody cared. But professionals do need an accurate word list. The word list is available. I suppose it should be released under some sort of GNU license. I could not get MUNCH or UNMUNCH for Hunspell (the people maintaining the dictionary seem to regard it as proprietary), but I did find the program MySpell, and was able to compile MUNCH and UNMUNCH using Puppy Linux, then used MySpell MUNCH to compile a dictionary from my word list, then transferred that dictionary to Windows and used it with Hunspell. I have also replaced the existing en_US.dic in Open Office with my own version and have been testing it out. It seems to work fine. Some work needs to be done on the possessive forms (apostrophe-S). I never used this with my own spelling checker, but instead parsed the root word. The same is true of WordPerfect: it looks at the root word, and drops the apostrophe-S. The reason for this is that ready-made possessive forms can never be accurate. English is loaded with words such as gerunds, which serve both as nouns and verbs (singing, stuffing etc). And there are plenty of words that function both as a noun and as an adjective. So making a sometimes noun possessive doesn't keep people from misusing it. Most nouns take an apostrophe-S, even if they end in S: Charles's tonsils, Jones's leg. But this rule doesn't apply to many ancient or historical words, so: Moses', Isis', Achilles'. The rule says I should write "Kansas's wheat fields." But if I write "Kansas's streams," then there is too much sibilance, so the astute editor will change it to "Kansas' streams." So neither WordPerfect nor I have ever tried to codify the use of possessives, since the knowledgeable writer knows it can't be done. Apostrophes are used for living things, personifications, or words of space, time, and weight. Also for common phrases like: heart's delight, stone's throw, and water's edge. Note that "chair's leg" does not fit this criteria. However the phrase "he fell back into the chair's embrace" seems to pass because chairs don't embrace, so this might be considered a personification. Most proper nouns such as Titanic or London can be used as personifications, so the names of cities, states, countries, rivers, and ships can easily take a possessive. Even words like Chemistry can take a possessive form: Department of Chemistry's examines. From this it is clear that many of the possessives that occur in the current en_US.dic fail to conform to grammatical rules. So a complete dictionary with possessives will probably take me a few weeks more, and even then the possessives will be questionable, in much the same way as the usage in Microsoft Word. David Dibble dibble_d@sbcglobal.net
Reassigned to lingucomponent.
David, Thanks in advance for your great contribution. I just started to make a new version for morphological analysis and generation based on the old en_US dictionary and WordNet data. There is an effort from Kevin Atkinson to make a maintained version from the OpenOffice.org en_US dic, see the result in the recent Mozilla Firefox (also here: https://bugzilla.mozilla.org/show_bug.cgi?id=397150 and http://wordlist.sourceforge.net). Unfortunately, it contains the same errors: $ grep '\(.\)\1\1' en_US.dic AAA Andeee/M Annnora/M BBB Diannne/M Harwilll/M KKK/M Lilllie/M Minnnie/M Rafaellle/M SSS Sonnnie/M WWW/M iii viii ... I'd like to examine also the corpus based methods to improve the dictionary data. I will use this issue for the discussion about the planned dictionary improvements. Best regards, László
Target: 3.1
Created attachment 56328 [details] revised en_US.dic
Here is the integrated US English dictionary for Open Office. You will find many thousands of new words beyond the existing dictionary. All words were checked against the American Heritage Dictionary or http://dictionary.reference.com. In some cases, such as words that begin with the prefix "un," these sources failed me, and I instead used http://www.merriam-webster.com for a full list of words with the "un" prefix from an unabridged dictionary. I went through the word list and added possessives manually. This dictionary is released under the Gnu GPL version 3: en_US.dic by David M. Dibble, copyright September, 2008 (Standard terms apply--This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.) I compiled the dictionary using MUNCH under Puppy Linux. The regular en_US.dic has a number of lines with numerals at the beginning (about 20 lines). Those can be inserted into this dictionary. I wasn't quite sure what those lines meant. Some quick explanations. Most dictionaries use common conventions. In a word entry, "OR" means that words have equal weight, as in "burned or burnt" (though the first listing may have a slight edge). In such cases both words are present in this dictionary. Dictionaries use "ALSO" to indicate a second-rate or inferior alternative, so in such cases the first listing should be used in a spell checker to encourage people to use the best choice. For instance, "papoose also pappoose." Microsoft Word uses "pappoose," but that word isn't even listed in the American Heritage Dictionary, and in the Random House Unabridged Dictionary the word "pappoose" is given as an "ALSO." So "papoose" is the best choice. However, dictionaries can flat out disagree. Some list "facade" [unaccented] as the best. Some list "facade" [c cedilla] as best. In that case both words are in the spell checker. And words change as time passes. "Sea bird" has always been two words, ("seawater" is one word), but I now think that "seabird" is acceptable. Then there are problems of capitalization. My word list has "leno, leno's, slough, slough's." Jay Leno is a TV personality. Slough is a municipality in England. So maybe the word list should be "leno, Leno's, slough, Slough's." I just wasn't sure. English is used internationally, as one sees on forums. So it seems odd to list every tiny town in the United States, but ignore the major metropolitan centers in the rest of the world. So I added many names for major cities, whether in Japan, or Brazil, or Pakistan. All these names should be correctly accented. And since I use Linux, I added names like Ubuntu, Xubuntu, Mandriva, AbiWord, Gnumeric, and so forth. Many place names have accents. But people often use common names like Yucatan or Guantanamo or Galapagos without accents, and may not even be aware of the accented form. So I decided to include both the unaccented words and the accented words, though often the possessive form is only given for the accented (correct) word. I removed words that could cause problems. I previously commented on "Lindberg" and "Lichtenstein." I also took out "corespondent," as many students will drop the R when they mean "correspondent," with humorous results. Besides, "corespondent" seems an outdated word; one very rarely hears it anymore. And I took out "nob" since students are sure to spell "knob" without the K. But if there is strong opinion that "corespondent" should be in the word list I would not object to seeing it put back in. A few months ago I used the word "stelar" in a review of a Tomb Raider custom level, but that word would just confuse people who want "stellar," so "stelar" isn't in the word list, either. In other words, a lot of judgment calls had to be made. Also I took out most of the hardcore profanity and offensive racial epithets. People can still freely use these words all they want; the words just aren't in the dictionary.
David, please subscribe to wordlist-devel@lists.sourceforge.net: https://lists.sourceforge.net/lists/listinfo/wordlist-devel We are working on the en_US dictionary for OpenOffice.org and Mozilla. Unfortunately, Mozilla has a more strict license policy, and it needs GPL/LGPL/MPL tri-license, GPL 3 is not enough, also for OpenOffice.org pre-bundled dictionaries. You have made a lot of nice developments, that we can integrate to the wordlist distribution or the generated Firefox/OpenOffice.org dictionaries under your name and work together on a better and up-to-date American English spelling dictionary. But you can also make your own dictionary version for OpenOffice.org using the Extension support (http://extensions.services.openoffice.org/). > The regular en_US.dic >has a number of lines with numerals at the beginning (about 20 lines). Those can >be inserted into this dictionary. I wasn't quite sure what those lines meant. It is for ordinal number checking (1st, *11st etc.)
You requested that I subscribe to wordlist-devel@lists.sourceforge.net: https://lists.sourceforge.net/lists/listinfo/wordlist-devel I have subscribed, and gotten a confirmation e-mail. I do not see a problem with other licenses beyond GPL 3. The dictionary represents hard years of work. My main concern was that I did not want a corporation to take my word list, encrypt it, and pass it off as their own spelling checker, sold for their profit. To that end I wanted to work in the open source community, such as with AbiWord, OpenOffice.org, and Mozilla. I used MUNCH under Puppy Linux to compile the word list, in dictionary format, and submitted it as an attachment. If you would prefer to view the word list before it was compiled, it can be sent in zip format. Then the word choices are clearer.
I began to read through my submitted spelling dictionary and noticed a couple of omissions. The words "antiquark" and "antilepton" both lack a plural entry. This is easily solved by adding /S to their entry in the word list. I am prepared to spend 7-10 days going through the word list, looking for such omissions, but wasn't sure what the status of the word list is. Is this something you intend to use, and if so, how much proofreading and checking is being done by others? If most of it isn't being used, then there is no rush for me to do anything. Second, I wanted to mention that the spelling checker can be enhanced by the Auto Correction feature in Open Office writer. It is very important for a published writer not to make mistakes. I read the "Wasteland" series by Stephen King, and in the third book he uses "for awhile" five times on facing pages. This is the sort of thing that makes one sit up. I can remember nothing else that was on those two pages, but years later still remember those five errors. "Awhile" is an adverb, so it cannot be the object of a preposition. Also "awhile" means "for a time" so saying "for awhile" is equivalent to saying "for for a time." The correct usage is the noun form, which is two words "a while," hence "for a while." Such mistakes are easily caught using Open Office Auto Correction, entry and replacement: "for awhile" "for a while" "after awhile" "after a while" "pointblank" "point-blank" "antisemitism" "anti-Semitism" This helps, since otherwise students may think that the omission of "pointblank" and "antisemitism" is a mistake. Note that Microsoft uses these two wrong entries, and Word Net usually goes along with Microsoft. This is typical Microsoft disregard of language, and professional writers do not endorse this. There are many other hyphenated words that can be included in the Auto Correction feature. I was often frustrated by not being able to include hyphenated words in the word list (though there are entries for "AK-47" and for "al-Qaeda"). Also I noticed that Hunspell does not catch very short accented words, such as "eclair" or "elan," which have acute accents over the E. The correctly accented word is in the dictionary, but Hunspell does not give it as a spelling suggestion. So use Auto Correction to make sure that such short accented words will be handled correctly. And have an entry for "deja vu" with all its accents. As it is, the Auto Correction feature is wasted. It functions exactly as in Microsoft Word, catching a few misspelled words, and this is better left to the spelling dictionary. Instead, the Auto Correction feature could become quite useful.
UPGRADE DICTIONARY. I have checked articles in the New York Times and the Wall Street Journal, and so forth, and am adding a number of new words, such as: Facebook, MySpace, Wikipedia, Geithner, cyberspy, etc. Am also adding a number of possessives to the dictionary, as there seems some confusion among writers about adjectives and nouns. Also spell checked some computer books. This will modernize the dictionary with thousands of additional words. I expect to release the update version in two weeks. It will be the same size as the original Open Office en_US.dic, though since I use real words, there will be a 50,000 word difference between my version and the dictionary packaged with Open Office.
Created attachment 61855 [details] updated, enhanced en_US.dic
I have uploaded the enhanced dictionary, dd_2009_04_en_US.dic. It works in Open Office as en_US.dic. There is now a 63,000 word difference between this and the original Open Office en_US.dic. I received an e-mail from the original dictionary maintainer saying that he will block any effort to replace HIS dictionary. It is regrettable Open Office won't allow improvements. However, I have filed issue #101500 in order to handle hyphenated words (Hunspell does not work with hyphenated words). If you ever allow people to work on the dictionary let me know.
OOo 3.1 is released. Please check the issue, if it still exists in OOo 3.1. If yes, please work on it to get it fixed in one of the next releases. Until then the issue get the target 3.x.
Created attachment 75820 [details] February 11, 2011 update of en_US.dic; 146,540 words
The word list was pruned of specialized or obscure words, particularly if those might interfere with finding more common words. As example, 'chough' and 'scoter' are birds, but most people will be interested in typing 'cough' or 'scooter.' Sometimes choices aren't clear. 'Whicker' is a horse's whinny, but perhaps there is a conflict with 'wicker.' 'Whicker' was removed. Often a dictionary will list plurals for words ending in 'o' as either -os or -oes, or words ending in 'a' as -as or -ae. If a dictionary separates the choices with 'or' then both plurals have equal weight, but a spellchecker may help a writer's consistency by only listing the first choice. It has not escaped my attention that removing words helps to make room for later additions, as a number of new words and proper nouns need to be added to keep the word list current. (May 6, 2009 version, 150,240 words. Current version 146,540 words.) The words 'shalt' and 'spake' are now in the list, but have been marked with an exclamation point for NO SUGGEST. Hunspell is good at dividing long words into two, and checking each portion, useful for a Hungarian spellchecker. It is unable to handle hyphenated words. For this spellchecker to function properly, users need to install an autocorrect word list in Open Office, so that when 'paperclipped' is typed, 'paper-clipped' is automatically substituted. This is also true for some accented words, so that typing 'elan' produces 'élan.' Unfortunately, Open Office doesn't use the autocorrect feature this way. Sources, listed in order of preference. 1) http://www.thefreedictionary.com/ American Heritage Dictionary, and Collins English Dictionary 2) http://dictionary.reference.com/ Random House Dictionary, Collins English Dictionary, Webster's Unabridged Dictionary 3) http://www.merriam-webster.com/ Merriam-Webster Dictionary 4) http://oxforddictionaries.com/?attempted=true The Oxford English Dictionary Dictionaries often disagree on compound words, or on spelling. Generally the Random House Dictionary is very good, but it gives a spelling of 'mujahedin.' Going to an Arabic source to clarify matters only adds to the confusion, as that site gives seven possible spellings. Other dictionaries use the word 'mujahideen,' so that seems preferable. Because of past problems with WordNet, I don't accept words with only this single source. WordNet gathers words from the web. This says nothings about the way people write, only that people are blindly reproducing the questionable Microsoft spellchecker, which has total dominance in the U.S. February 11, 2011
Created attachment 75850 [details] autocorrect hyphenates, compound words, grammar errors; word list
PROBLEM WORDS, submit autocorrect suggestions in plain text A number of problem words were created by Microsoft's spelling mistakes. Because of Microsoft's total domination in the United States, these errors have been compounded a billions times in the past dozen years. Some words are entering the language. The surprising thing is that more words haven't been subverted, but that most intelligent people still continue to write "point-blank" instead of the ugly Microsoft "pointblank." Microsoft handled things by just removing essential hyphens. After all, no sense paying for programmers, or for anyone who knew even a smattering of the English language. These problem words have to be checked every year. The Oxford English Dictionary (OED) monitors ten thousand transitional words on a daily basis. The OED now accepts the following nouns as one word, and they may be included in en_US.dic (they are missing from my version): airbag airbase lifebuoy waterhole - one word OED, all other dictionaries, two words Note the following key words: "all right" - the only correct usage. "alright" not acceptable "point-blank" (Microsoft's word, "pointblank," not acceptable) The preferred choice for "cafe" is now without an accented e. The OED has "razor blade." Everyone else uses "razorblade." The American Heritage Dictionary has "mockup," "mahjong," and "housepainter." OED and others use "mock-up." Almost everyone uses "mah-jongg" and "house painter." (May want to put "mah-jongg" in the autocorrect list.) The OED has "hot plate" and "hot pot," but Collins English Dictionary has these as one word. Here are some words I really hate to look up every year. Real dictionaries list them as two words. This isn't a complete list, but these words are noted in the attached autocorrect suggestions file. bean sprouts black light coal mine con man, con men drift net drop kick fire truck floor show fly swatter [one dic. has flyswatter as second-rate choice] gun battle hair dryer ice pack land mine love child milk shake nose cone school day six-shooter [hyphenation is correct, as here] staff room [British usage, a teachers' lounge, one word] tea bag tea leaves trash can water mill Hunspell does not handle hyphenated words, but these can be substituted using the autocorrect feature of Open Office. Also some accented words aren't found by Hunspell, and they are in the list, as well as recommendations for various word substitutions. The fifteen-page plain text file is attached.
Created attachment 76178 [details] March 23, 2011: use hyphenated words; detect accents; many new words The revisions and updates are done to my Open Office U.S. English spellchecker. There is a 55,000-word difference to the official Open Office version of December 27, 2010. WHAT WAS DONE Hyphenated words may now be checked as a unit. This makes Hunspell a professional spellchecker. Common words like "has-beens," "have-nots," or "lean-tos" no longer display an error, as such words are in en_US.dic. Type "Toulouse-letreck" and the speller will recommend "Toulouse-Lautrec." To use the new hyphenated-words feature you need Open Office 3.2 or later (doesn't work on 3.1). Tested on 3.2.1. Changed the affix file. Now Hunspell will recommend accented words such as "señor, cliché, touché, piñata, or garçon." These were in my word list, but Hunspell was unable to suggest them as replacements. Now Hunspell recommends "all right" for "alright." Further modified affix file for better dictionary compression, new prefix (PFX O) and suffix (SFX Q and W) entries. (First line of SFX entry contains N for NO; the man/woman SFX entries in en_AU.aff and en_NZ.aff are incorrect.) Made major revisions to word list. Added hundreds of new words like: cyberbullying, cyberbully, cybersecurity, bridezilla, overleveraged (from subprime mortgage crash), BFF, microblog, microblogging, steampunk. New words are approved by Oxford English Dictionary. Added a large number of hyphenated words. David M. Dibble
Created attachment 76414 [details] April 19, 2011: affix file for better compression; expand hyphenated entries Changed affix file to give much better compression. On a reasonable sized word list, the dictionary file will be 70-80k smaller than before. Due to new affix file, current en_US.dic is only 568k. In order to improve hyphenation entries, went through entire word list. While doing this, removed variant words and plurals for clarity. Many hyphenated words can be formed from individual words, and need not be included in the spellchecker. The most helpful entries should now be in en_US.dic.
Hello; Please note that we will not receive contributions under copyleft licenses anymore. When you can, we recommend Apache License 2.0 to be consistent with the rest of the suite. http://www.apache.org/licenses/LICENSE-2.0 This license is considered GPL3 compatible by the FSF and will permit wider distribution of your work.
Dictionaries were removed as part of Apache IP Clearance process.