Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing |
Summary: | submit new en_US.dic without the errors | ||
---|---|---|---|
Product: | General | Reporter: | aardvark12 <dibble_d> |
Component: | spell checking | Assignee: | nemeth.lacko |
Status: | CLOSED NOT_AN_OOO_ISSUE | QA Contact: | issues@lingucomponent <issues> |
Severity: | Trivial | ||
Priority: | P3 | CC: | issues, kevina |
Version: | 3.3.0 or older (OOo) | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | All | ||
Issue Type: | ENHANCEMENT | Latest Confirmation in: | --- |
Developer Difficulty: | --- | ||
Attachments: |
Description
aardvark12
2008-08-01 17:08:09 UTC
Reassigned to lingucomponent. David, Thanks in advance for your great contribution. I just started to make a new version for morphological analysis and generation based on the old en_US dictionary and WordNet data. There is an effort from Kevin Atkinson to make a maintained version from the OpenOffice.org en_US dic, see the result in the recent Mozilla Firefox (also here: https://bugzilla.mozilla.org/show_bug.cgi?id=397150 and http://wordlist.sourceforge.net). Unfortunately, it contains the same errors: $ grep '\(.\)\1\1' en_US.dic AAA Andeee/M Annnora/M BBB Diannne/M Harwilll/M KKK/M Lilllie/M Minnnie/M Rafaellle/M SSS Sonnnie/M WWW/M iii viii ... I'd like to examine also the corpus based methods to improve the dictionary data. I will use this issue for the discussion about the planned dictionary improvements. Best regards, László Target: 3.1 Created attachment 56328 [details]
revised en_US.dic
Here is the integrated US English dictionary for Open Office. You will find many thousands of new words beyond the existing dictionary. All words were checked against the American Heritage Dictionary or http://dictionary.reference.com. In some cases, such as words that begin with the prefix "un," these sources failed me, and I instead used http://www.merriam-webster.com for a full list of words with the "un" prefix from an unabridged dictionary. I went through the word list and added possessives manually. This dictionary is released under the Gnu GPL version 3: en_US.dic by David M. Dibble, copyright September, 2008 (Standard terms apply--This is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.) I compiled the dictionary using MUNCH under Puppy Linux. The regular en_US.dic has a number of lines with numerals at the beginning (about 20 lines). Those can be inserted into this dictionary. I wasn't quite sure what those lines meant. Some quick explanations. Most dictionaries use common conventions. In a word entry, "OR" means that words have equal weight, as in "burned or burnt" (though the first listing may have a slight edge). In such cases both words are present in this dictionary. Dictionaries use "ALSO" to indicate a second-rate or inferior alternative, so in such cases the first listing should be used in a spell checker to encourage people to use the best choice. For instance, "papoose also pappoose." Microsoft Word uses "pappoose," but that word isn't even listed in the American Heritage Dictionary, and in the Random House Unabridged Dictionary the word "pappoose" is given as an "ALSO." So "papoose" is the best choice. However, dictionaries can flat out disagree. Some list "facade" [unaccented] as the best. Some list "facade" [c cedilla] as best. In that case both words are in the spell checker. And words change as time passes. "Sea bird" has always been two words, ("seawater" is one word), but I now think that "seabird" is acceptable. Then there are problems of capitalization. My word list has "leno, leno's, slough, slough's." Jay Leno is a TV personality. Slough is a municipality in England. So maybe the word list should be "leno, Leno's, slough, Slough's." I just wasn't sure. English is used internationally, as one sees on forums. So it seems odd to list every tiny town in the United States, but ignore the major metropolitan centers in the rest of the world. So I added many names for major cities, whether in Japan, or Brazil, or Pakistan. All these names should be correctly accented. And since I use Linux, I added names like Ubuntu, Xubuntu, Mandriva, AbiWord, Gnumeric, and so forth. Many place names have accents. But people often use common names like Yucatan or Guantanamo or Galapagos without accents, and may not even be aware of the accented form. So I decided to include both the unaccented words and the accented words, though often the possessive form is only given for the accented (correct) word. I removed words that could cause problems. I previously commented on "Lindberg" and "Lichtenstein." I also took out "corespondent," as many students will drop the R when they mean "correspondent," with humorous results. Besides, "corespondent" seems an outdated word; one very rarely hears it anymore. And I took out "nob" since students are sure to spell "knob" without the K. But if there is strong opinion that "corespondent" should be in the word list I would not object to seeing it put back in. A few months ago I used the word "stelar" in a review of a Tomb Raider custom level, but that word would just confuse people who want "stellar," so "stelar" isn't in the word list, either. In other words, a lot of judgment calls had to be made. Also I took out most of the hardcore profanity and offensive racial epithets. People can still freely use these words all they want; the words just aren't in the dictionary. David, please subscribe to wordlist-devel@lists.sourceforge.net: https://lists.sourceforge.net/lists/listinfo/wordlist-devel We are working on the en_US dictionary for OpenOffice.org and Mozilla. Unfortunately, Mozilla has a more strict license policy, and it needs GPL/LGPL/MPL tri-license, GPL 3 is not enough, also for OpenOffice.org pre-bundled dictionaries. You have made a lot of nice developments, that we can integrate to the wordlist distribution or the generated Firefox/OpenOffice.org dictionaries under your name and work together on a better and up-to-date American English spelling dictionary. But you can also make your own dictionary version for OpenOffice.org using the Extension support (http://extensions.services.openoffice.org/). > The regular en_US.dic >has a number of lines with numerals at the beginning (about 20 lines). Those can >be inserted into this dictionary. I wasn't quite sure what those lines meant. It is for ordinal number checking (1st, *11st etc.) You requested that I subscribe to wordlist-devel@lists.sourceforge.net: https://lists.sourceforge.net/lists/listinfo/wordlist-devel I have subscribed, and gotten a confirmation e-mail. I do not see a problem with other licenses beyond GPL 3. The dictionary represents hard years of work. My main concern was that I did not want a corporation to take my word list, encrypt it, and pass it off as their own spelling checker, sold for their profit. To that end I wanted to work in the open source community, such as with AbiWord, OpenOffice.org, and Mozilla. I used MUNCH under Puppy Linux to compile the word list, in dictionary format, and submitted it as an attachment. If you would prefer to view the word list before it was compiled, it can be sent in zip format. Then the word choices are clearer. I began to read through my submitted spelling dictionary and noticed a couple of omissions. The words "antiquark" and "antilepton" both lack a plural entry. This is easily solved by adding /S to their entry in the word list. I am prepared to spend 7-10 days going through the word list, looking for such omissions, but wasn't sure what the status of the word list is. Is this something you intend to use, and if so, how much proofreading and checking is being done by others? If most of it isn't being used, then there is no rush for me to do anything. Second, I wanted to mention that the spelling checker can be enhanced by the Auto Correction feature in Open Office writer. It is very important for a published writer not to make mistakes. I read the "Wasteland" series by Stephen King, and in the third book he uses "for awhile" five times on facing pages. This is the sort of thing that makes one sit up. I can remember nothing else that was on those two pages, but years later still remember those five errors. "Awhile" is an adverb, so it cannot be the object of a preposition. Also "awhile" means "for a time" so saying "for awhile" is equivalent to saying "for for a time." The correct usage is the noun form, which is two words "a while," hence "for a while." Such mistakes are easily caught using Open Office Auto Correction, entry and replacement: "for awhile" "for a while" "after awhile" "after a while" "pointblank" "point-blank" "antisemitism" "anti-Semitism" This helps, since otherwise students may think that the omission of "pointblank" and "antisemitism" is a mistake. Note that Microsoft uses these two wrong entries, and Word Net usually goes along with Microsoft. This is typical Microsoft disregard of language, and professional writers do not endorse this. There are many other hyphenated words that can be included in the Auto Correction feature. I was often frustrated by not being able to include hyphenated words in the word list (though there are entries for "AK-47" and for "al-Qaeda"). Also I noticed that Hunspell does not catch very short accented words, such as "eclair" or "elan," which have acute accents over the E. The correctly accented word is in the dictionary, but Hunspell does not give it as a spelling suggestion. So use Auto Correction to make sure that such short accented words will be handled correctly. And have an entry for "deja vu" with all its accents. As it is, the Auto Correction feature is wasted. It functions exactly as in Microsoft Word, catching a few misspelled words, and this is better left to the spelling dictionary. Instead, the Auto Correction feature could become quite useful. UPGRADE DICTIONARY. I have checked articles in the New York Times and the Wall Street Journal, and so forth, and am adding a number of new words, such as: Facebook, MySpace, Wikipedia, Geithner, cyberspy, etc. Am also adding a number of possessives to the dictionary, as there seems some confusion among writers about adjectives and nouns. Also spell checked some computer books. This will modernize the dictionary with thousands of additional words. I expect to release the update version in two weeks. It will be the same size as the original Open Office en_US.dic, though since I use real words, there will be a 50,000 word difference between my version and the dictionary packaged with Open Office. Created attachment 61855 [details]
updated, enhanced en_US.dic
I have uploaded the enhanced dictionary, dd_2009_04_en_US.dic. It works in Open Office as en_US.dic. There is now a 63,000 word difference between this and the original Open Office en_US.dic. I received an e-mail from the original dictionary maintainer saying that he will block any effort to replace HIS dictionary. It is regrettable Open Office won't allow improvements. However, I have filed issue #101500 in order to handle hyphenated words (Hunspell does not work with hyphenated words). If you ever allow people to work on the dictionary let me know. OOo 3.1 is released. Please check the issue, if it still exists in OOo 3.1. If yes, please work on it to get it fixed in one of the next releases. Until then the issue get the target 3.x. Created attachment 75820 [details]
February 11, 2011 update of en_US.dic; 146,540 words
The word list was pruned of specialized or obscure words, particularly if those might interfere with finding more common words. As example, 'chough' and 'scoter' are birds, but most people will be interested in typing 'cough' or 'scooter.' Sometimes choices aren't clear. 'Whicker' is a horse's whinny, but perhaps there is a conflict with 'wicker.' 'Whicker' was removed. Often a dictionary will list plurals for words ending in 'o' as either -os or -oes, or words ending in 'a' as -as or -ae. If a dictionary separates the choices with 'or' then both plurals have equal weight, but a spellchecker may help a writer's consistency by only listing the first choice. It has not escaped my attention that removing words helps to make room for later additions, as a number of new words and proper nouns need to be added to keep the word list current. (May 6, 2009 version, 150,240 words. Current version 146,540 words.) The words 'shalt' and 'spake' are now in the list, but have been marked with an exclamation point for NO SUGGEST. Hunspell is good at dividing long words into two, and checking each portion, useful for a Hungarian spellchecker. It is unable to handle hyphenated words. For this spellchecker to function properly, users need to install an autocorrect word list in Open Office, so that when 'paperclipped' is typed, 'paper-clipped' is automatically substituted. This is also true for some accented words, so that typing 'elan' produces 'élan.' Unfortunately, Open Office doesn't use the autocorrect feature this way. Sources, listed in order of preference. 1) http://www.thefreedictionary.com/ American Heritage Dictionary, and Collins English Dictionary 2) http://dictionary.reference.com/ Random House Dictionary, Collins English Dictionary, Webster's Unabridged Dictionary 3) http://www.merriam-webster.com/ Merriam-Webster Dictionary 4) http://oxforddictionaries.com/?attempted=true The Oxford English Dictionary Dictionaries often disagree on compound words, or on spelling. Generally the Random House Dictionary is very good, but it gives a spelling of 'mujahedin.' Going to an Arabic source to clarify matters only adds to the confusion, as that site gives seven possible spellings. Other dictionaries use the word 'mujahideen,' so that seems preferable. Because of past problems with WordNet, I don't accept words with only this single source. WordNet gathers words from the web. This says nothings about the way people write, only that people are blindly reproducing the questionable Microsoft spellchecker, which has total dominance in the U.S. February 11, 2011 Created attachment 75850 [details]
autocorrect hyphenates, compound words, grammar errors; word list
PROBLEM WORDS, submit autocorrect suggestions in plain text A number of problem words were created by Microsoft's spelling mistakes. Because of Microsoft's total domination in the United States, these errors have been compounded a billions times in the past dozen years. Some words are entering the language. The surprising thing is that more words haven't been subverted, but that most intelligent people still continue to write "point-blank" instead of the ugly Microsoft "pointblank." Microsoft handled things by just removing essential hyphens. After all, no sense paying for programmers, or for anyone who knew even a smattering of the English language. These problem words have to be checked every year. The Oxford English Dictionary (OED) monitors ten thousand transitional words on a daily basis. The OED now accepts the following nouns as one word, and they may be included in en_US.dic (they are missing from my version): airbag airbase lifebuoy waterhole - one word OED, all other dictionaries, two words Note the following key words: "all right" - the only correct usage. "alright" not acceptable "point-blank" (Microsoft's word, "pointblank," not acceptable) The preferred choice for "cafe" is now without an accented e. The OED has "razor blade." Everyone else uses "razorblade." The American Heritage Dictionary has "mockup," "mahjong," and "housepainter." OED and others use "mock-up." Almost everyone uses "mah-jongg" and "house painter." (May want to put "mah-jongg" in the autocorrect list.) The OED has "hot plate" and "hot pot," but Collins English Dictionary has these as one word. Here are some words I really hate to look up every year. Real dictionaries list them as two words. This isn't a complete list, but these words are noted in the attached autocorrect suggestions file. bean sprouts black light coal mine con man, con men drift net drop kick fire truck floor show fly swatter [one dic. has flyswatter as second-rate choice] gun battle hair dryer ice pack land mine love child milk shake nose cone school day six-shooter [hyphenation is correct, as here] staff room [British usage, a teachers' lounge, one word] tea bag tea leaves trash can water mill Hunspell does not handle hyphenated words, but these can be substituted using the autocorrect feature of Open Office. Also some accented words aren't found by Hunspell, and they are in the list, as well as recommendations for various word substitutions. The fifteen-page plain text file is attached. Created attachment 76178 [details]
March 23, 2011: use hyphenated words; detect accents; many new words
The revisions and updates are done to my Open Office U.S. English spellchecker. There is a 55,000-word difference to the official Open Office version of December 27, 2010.
WHAT WAS DONE
Hyphenated words may now be checked as a unit. This makes Hunspell a professional spellchecker. Common words like "has-beens," "have-nots," or "lean-tos" no longer display an error, as such words are in en_US.dic. Type "Toulouse-letreck" and the speller will recommend "Toulouse-Lautrec." To use the new hyphenated-words feature you need Open Office 3.2 or later (doesn't work on 3.1). Tested on 3.2.1.
Changed the affix file. Now Hunspell will recommend accented words such as "señor, cliché, touché, piñata, or garçon." These were in my word list, but Hunspell was unable to suggest them as replacements. Now Hunspell recommends "all right" for "alright." Further modified affix file for better dictionary compression, new prefix (PFX O) and suffix (SFX Q and W) entries. (First line of SFX entry contains N for NO; the man/woman SFX entries in en_AU.aff and en_NZ.aff are incorrect.)
Made major revisions to word list. Added hundreds of new words like: cyberbullying, cyberbully, cybersecurity, bridezilla, overleveraged (from subprime mortgage crash), BFF, microblog, microblogging, steampunk. New words are approved by Oxford English Dictionary. Added a large number of hyphenated words.
David M. Dibble
Created attachment 76414 [details]
April 19, 2011: affix file for better compression; expand hyphenated entries
Changed affix file to give much better compression. On a reasonable sized word list, the dictionary file will be 70-80k smaller than before. Due to new affix file, current en_US.dic is only 568k.
In order to improve hyphenation entries, went through entire word list. While doing this, removed variant words and plurals for clarity. Many hyphenated words can be formed from individual words, and need not be included in the spellchecker. The most helpful entries should now be in en_US.dic.
Hello; Please note that we will not receive contributions under copyleft licenses anymore. When you can, we recommend Apache License 2.0 to be consistent with the rest of the suite. http://www.apache.org/licenses/LICENSE-2.0 This license is considered GPL3 compatible by the FSF and will permit wider distribution of your work. Dictionaries were removed as part of Apache IP Clearance process. |