Issue 92383

Summary: submit new en_US.dic without the errors
Product: General Reporter: aardvark12 <dibble_d>
Component: spell checkingAssignee: nemeth.lacko
Status: CLOSED NOT_AN_OOO_ISSUE QA Contact: issues@lingucomponent <issues>
Severity: Trivial    
Priority: P3 CC: issues, kevina
Version: 3.3.0 or older (OOo)   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
revised en_US.dic
none
updated, enhanced en_US.dic
none
February 11, 2011 update of en_US.dic; 146,540 words
none
autocorrect hyphenates, compound words, grammar errors; word list
none
March 23, 2011: use hyphenated words; detect accents; many new words
none
April 19, 2011: affix file for better compression; expand hyphenated entries none

Description aardvark12 2008-08-01 17:08:09 UTC
I have completed my own en_US.dic for spell checking in Open Office. This was
essential.

There are a large number of errors in the existing Open Office en_US.dic
spelling dictionary. This is not surprising; it seems that someone used
Microsoft Word to check online word lists, such as the 110,000 word list that is
commonly available, and then included the approved results in the Open Office
spelling dictionary. The problem is that that 110,000 word list, which is
advertised as suitable for spell checking, contains about 8,000 errors. Many of
those errors ended up in Microsoft Word itself. Anyone using that word list, or
relying on Microsoft Word to produce error-free dictionaries, is going to end up
with a spell checker that is riddled with errors.

As an example, when I look at Microsoft Word or Open Office (the errors are the
same, except that Open Office has more of them), I see things like airbag [air
bag], airbase [air base], pointblank [point-blank], teabag [tea bag], tealeaves
[tea leaves], sanserif [sans serif], Roobbie [?], slowcoaches [?], antisemitic,
antisemitism [anti-Semitic, anti-Semitism], rightsize, eageyness, or Rafaellle.
Very few English words have 3 L's, yet if you use a simple search on en_US.dic,
you will find other words besides Rafaellle with 3 L's. There is little sense
trying to list every problem. I also have serious difficulties with the word
choices (again, all these problems stem from MS Word). The famous aviator is
Lindbergh, so why put the name Lindberg in a spell checker and create problems
for students? The name of the country is Liechtenstein, so why put Lichtenstein
in a spell checker? (There is a Roy Lichtenstein but, apologies to Roy, nobody
cares. Most people want the name of the country.) I thought that to remove the
garbage from en_US.dic I might have to take out 3,000 or 4,000 words, but the
actual number was much higher.

I have seen published novels that relied on Microsoft Word. They may contained a
dozen or more misspelled words. Every error jerks a reader out of the illusion
created by the writing, and causes the reader to question the writer's
credibility. After about six errors many readers consider discarding a book.
Professionally produced books should not contain any errors--not one.

I have two novels on Amazon.com. I created my own spelling checker in 1993, and
every few years I revised the word list, so I have been at this for a while. In
2001, I released WORDFUN2.ZIP (118,000 word list), which is on simtel.net. That
spell checker contained many words suitable for Scrabble play. I used a
different spell checker for producing books. My spelling checker was updated in
2003 and in 2006, and in May-July 2008 I did a complete check of the word list
against published dictionaries, and integrated words from Open Office en_US.dic.

I typically use http://dictionary.reference.com. Entered words are checked
against the American Heritage Dictionary (my favorite), and the Random House
Unabridged Dictionary (very good), and a Webster's Unabridged (can be
questionable as it is so inclusive) and against WordNet (not to be trusted, as
some of their choices are flat out wrong).

My dictionary is very close in size to the existing en_US.dic used in Open
Office. I would like to offer it as an alternative for writers or business
professionals who don't want to look like idiots. (My dictionary doesn't contain
"alright." Nonprofessionals think it's just fine to use "alright," and will
start screaming and spitting at you that "alright" is a word.  Every writer I
know thinks the usage of that word is a sure indication of illiteracy. The
correct usage is "all right.")

I realize that most people won't care. I used to complain that the online
110,000 English word list, recommended for use in spell checkers, contained
8,000 misspellings. Nobody cared. But professionals do need an accurate word list.

The word list is available. I suppose it should be released under some sort of
GNU license. I could not get MUNCH or UNMUNCH for Hunspell (the people
maintaining the dictionary seem to regard it as proprietary), but I did find the
program MySpell, and was able to compile MUNCH and UNMUNCH using Puppy Linux,
then used MySpell MUNCH to compile a dictionary from my word list, then
transferred that dictionary to Windows and used it with Hunspell. I have also
replaced the existing en_US.dic in Open Office with my own version and have been
testing it out. It seems to work fine.

Some work needs to be done on the possessive forms (apostrophe-S). I never used
this with my own spelling checker, but instead parsed the root word. The same is
true of WordPerfect: it looks at the root word, and drops the apostrophe-S. The
reason for this is that ready-made possessive forms can never be accurate. 
English is loaded with words such as gerunds, which serve both as nouns and
verbs (singing, stuffing etc). And there are plenty of words that function both
as a noun and as an adjective. So making a sometimes noun possessive doesn't
keep people from misusing it. Most nouns take an apostrophe-S, even if they end
in S: Charles's tonsils, Jones's leg. But this rule doesn't apply to many
ancient or historical words, so: Moses', Isis', Achilles'. The rule says I
should write "Kansas's wheat fields." But if I write "Kansas's streams," then
there is too much sibilance, so the astute editor will change it to "Kansas'
streams." So neither WordPerfect nor I have ever tried to codify the use of
possessives, since the knowledgeable writer knows it can't be done.

Apostrophes are used for living things, personifications, or words of space,
time, and weight. Also for common phrases like: heart's delight, stone's throw,
and water's edge. Note that "chair's leg" does not fit this criteria. However
the phrase "he fell back into the chair's embrace" seems to pass because chairs
don't embrace, so this might be considered a personification. Most proper nouns
such as Titanic or London can be used as personifications, so the names of
cities, states, countries, rivers, and ships can easily take a possessive. Even
words like Chemistry can take a possessive form: Department of Chemistry's
examines. From this it is clear that many of the possessives that occur in the
current en_US.dic fail to conform to grammatical rules.

So a complete dictionary with possessives will probably take me a few weeks
more, and even then the possessives will be questionable, in much the same way
as the usage in Microsoft Word. 

David Dibble
dibble_d@sbcglobal.net
Comment 1 michael.ruess 2008-08-04 14:11:01 UTC
Reassigned to lingucomponent.
Comment 2 nemeth.lacko 2008-08-04 17:24:43 UTC
David,

Thanks in advance for your great contribution. I just started to make a new
version for morphological analysis and generation based on the old en_US
dictionary and WordNet data. There is an effort from Kevin Atkinson to make a
maintained version from the OpenOffice.org en_US dic, see the result in the
recent Mozilla Firefox (also here:
https://bugzilla.mozilla.org/show_bug.cgi?id=397150 and
http://wordlist.sourceforge.net). Unfortunately, it contains the same errors:

$ grep '\(.\)\1\1' en_US.dic
AAA
Andeee/M
Annnora/M
BBB
Diannne/M
Harwilll/M
KKK/M
Lilllie/M
Minnnie/M
Rafaellle/M
SSS
Sonnnie/M
WWW/M
iii
viii
...

I'd like to examine also the corpus based methods to improve the dictionary
data. I will use this issue for the discussion about the planned dictionary
improvements.

Best regards,
László
Comment 3 nemeth.lacko 2008-08-04 17:26:03 UTC
Target: 3.1
Comment 4 aardvark12 2008-09-08 19:53:04 UTC
Created attachment 56328 [details]
revised en_US.dic
Comment 5 aardvark12 2008-09-08 19:58:21 UTC
Here is the integrated US English dictionary for Open Office. You will find many
thousands of new words beyond the existing dictionary. All words were checked
against the American Heritage Dictionary or http://dictionary.reference.com. In
some cases, such as words that begin with the prefix "un," these sources failed
me, and I instead used http://www.merriam-webster.com for a full list of words
with the "un" prefix from an unabridged dictionary. I went through the word list
and added possessives manually.
   
This dictionary is released under the Gnu GPL version 3: en_US.dic by David M.
Dibble, copyright September, 2008 (Standard terms apply--This is free software:
you can redistribute it and/or modify it under the terms of the GNU General
Public License as published by the Free Software Foundation, either version 3 of
the License, or (at your option) any later version.)

I compiled the dictionary using MUNCH under Puppy Linux. The regular en_US.dic
has a number of lines with numerals at the beginning (about 20 lines). Those can
be inserted into this dictionary. I wasn't quite sure what those lines meant.

Some quick explanations. Most dictionaries use common conventions. In a word
entry, "OR" means that words have equal weight, as in "burned or burnt" (though
the first listing may have a slight edge). In such cases both words are present
in this dictionary. Dictionaries use "ALSO" to indicate a second-rate or
inferior alternative, so in such cases the first listing should be used in a
spell checker to encourage people to use the best choice. For instance, "papoose
also pappoose." Microsoft Word uses "pappoose," but that word isn't even listed
in the American Heritage Dictionary, and in the Random House Unabridged
Dictionary the word "pappoose" is given as an "ALSO." So "papoose" is the best
choice. However, dictionaries can flat out disagree. Some list "facade"
[unaccented] as the best. Some list "facade" [c cedilla] as best. In that case
both words are in the spell checker. And words change as time passes. "Sea bird"
has always been two words, ("seawater" is one word), but I now think that
"seabird" is acceptable.

Then there are problems of capitalization. My word list has "leno, leno's,
slough, slough's." Jay Leno is a TV personality. Slough is a municipality in
England. So maybe the word list should be "leno, Leno's, slough, Slough's." I
just wasn't sure.

English is used internationally, as one sees on forums. So it seems odd to list
every tiny town in the United States, but ignore the major metropolitan centers
in the rest of the world. So I added many names for major cities, whether in
Japan, or Brazil, or Pakistan. All these names should be correctly accented. And
since I use Linux, I added names like Ubuntu, Xubuntu, Mandriva, AbiWord,
Gnumeric, and so forth.

Many place names have accents. But people often use common names like Yucatan or
Guantanamo or Galapagos without accents, and may not even be aware of the
accented form. So I decided to include both the unaccented words and the
accented words, though often the possessive form is only given for the accented
(correct) word.

I removed words that could cause problems. I previously commented on "Lindberg"
and "Lichtenstein." I also took out "corespondent," as many students will drop
the R when they mean "correspondent," with humorous results. Besides,
"corespondent" seems an outdated word; one very rarely hears it anymore. And I
took out "nob" since students are sure to spell "knob" without the K. But if
there is strong opinion that "corespondent" should be in the word list I would
not object to seeing it put back in. A few months ago I used the word "stelar"
in a review of a Tomb Raider custom level, but that word would just confuse
people who want "stellar," so "stelar" isn't in the word list, either. In other
words, a lot of judgment calls had to be made. Also I took out most of the
hardcore profanity and offensive racial epithets. People can still freely use
these words all they want; the words just aren't in the dictionary. 
Comment 6 nemeth.lacko 2008-12-07 04:58:54 UTC
David, please subscribe to wordlist-devel@lists.sourceforge.net:
 https://lists.sourceforge.net/lists/listinfo/wordlist-devel

We are working on the en_US dictionary for OpenOffice.org and Mozilla.
Unfortunately, Mozilla has a more strict license policy, and it needs
GPL/LGPL/MPL tri-license, GPL 3 is not enough, also for OpenOffice.org
pre-bundled dictionaries. You have made a lot of nice developments, that we can
integrate to the wordlist distribution or the generated Firefox/OpenOffice.org
dictionaries under your name and work together on a better and up-to-date
American English spelling dictionary. But you can also make your own dictionary
version for OpenOffice.org using the Extension support
(http://extensions.services.openoffice.org/).

> The regular en_US.dic
>has a number of lines with numerals at the beginning (about 20 lines). Those can
>be inserted into this dictionary. I wasn't quite sure what those lines meant.

It is for ordinal number checking (1st, *11st etc.)
Comment 7 aardvark12 2008-12-12 00:01:53 UTC
You requested that I subscribe to wordlist-devel@lists.sourceforge.net:
 https://lists.sourceforge.net/lists/listinfo/wordlist-devel

I have subscribed, and gotten a confirmation e-mail.

I do not see a problem with other licenses beyond GPL 3. The dictionary
represents hard years of work. My main concern was that I did not want a
corporation to take my word list, encrypt it, and pass it off as their own
spelling checker, sold for their profit. To that end I wanted to work in the
open source community, such as with AbiWord, OpenOffice.org, and Mozilla.

I used MUNCH under Puppy Linux to compile the word list, in dictionary format,
and submitted it as an attachment. If you would prefer to view the word list
before it was compiled, it can be sent in zip format. Then the word choices are
clearer.
Comment 8 aardvark12 2009-02-21 17:44:01 UTC
I began to read through my submitted spelling dictionary and noticed a couple of
omissions. The words "antiquark" and "antilepton" both lack a plural entry. This
is easily solved by adding /S to their entry in the word list. I am prepared to
spend 7-10 days going through the word list, looking for such omissions, but
wasn't sure what the status of the word list is. Is this something you intend to
use, and if so, how much proofreading and checking is being done by others? If
most of it isn't being used, then there is no rush for me to do anything.

Second, I wanted to mention that the spelling checker can be enhanced by the
Auto Correction feature in Open Office writer.

It is very important for a published writer not to make mistakes. I read the
"Wasteland" series by Stephen King, and in the third book he uses "for awhile"
five times on facing pages. This is the sort of thing that makes one sit up. I
can remember nothing else that was on those two pages, but years later still
remember those five errors. "Awhile" is an adverb, so it cannot be the object of
a preposition. Also "awhile" means "for a time" so saying "for awhile" is
equivalent to saying "for for a time." The correct usage is the noun form, which
is two words "a while," hence "for a while."

Such mistakes are easily caught using Open Office Auto Correction, entry and
replacement:

"for awhile"  "for a while"
"after awhile" "after a while"
"pointblank"  "point-blank"
"antisemitism" "anti-Semitism"

This helps, since otherwise students may think that the omission of "pointblank"
and "antisemitism" is a mistake. Note that Microsoft uses these two wrong
entries, and Word Net usually goes along with Microsoft. This is typical
Microsoft disregard of language, and professional writers do not endorse this.
There are many other hyphenated words that can be included in the Auto
Correction feature. I was often frustrated by not being able to include
hyphenated words in the word list (though there are entries for "AK-47" and for
"al-Qaeda").

Also I noticed that Hunspell does not catch very short accented words, such as
"eclair" or "elan," which have acute accents over the E. The correctly accented
word is in the dictionary, but Hunspell does not give it as a spelling
suggestion. So use Auto Correction to make sure that such short accented words
will be handled correctly. And have an entry for "deja vu" with all its accents.

As it is, the Auto Correction feature is wasted. It functions exactly as in
Microsoft Word, catching a few misspelled words, and this is better left to the
spelling dictionary. Instead, the Auto Correction feature could become quite useful.
Comment 9 aardvark12 2009-04-14 16:22:22 UTC
UPGRADE  DICTIONARY.
I have checked articles in the New York Times and the Wall Street Journal, and
so forth, and am adding a number of new words, such as: Facebook, MySpace,
Wikipedia, Geithner, cyberspy, etc. Am also adding a number of possessives to
the dictionary, as there seems some confusion among writers about adjectives and
nouns. Also spell checked some computer books. This will modernize the
dictionary with thousands of additional words.

I expect to release the update version in two weeks. It will be the same size as
the original Open Office en_US.dic, though since I use real words, there will be
a 50,000 word difference between my version and the dictionary packaged with
Open Office.
Comment 10 aardvark12 2009-04-27 19:36:18 UTC
Created attachment 61855 [details]
updated, enhanced en_US.dic
Comment 11 aardvark12 2009-05-03 13:46:49 UTC
I have uploaded the enhanced dictionary, dd_2009_04_en_US.dic.  It works in Open
Office as en_US.dic. There is now a 63,000 word difference between this and the
original Open Office en_US.dic.

I received an e-mail from the original dictionary maintainer saying that he will
block any effort to replace HIS dictionary. 

It is regrettable Open Office won't allow improvements. However, I have filed
issue #101500 in order to handle hyphenated words (Hunspell does not work with
hyphenated words).

If you ever allow people to work on the dictionary let me know.
Comment 12 thorsten.ziehm 2009-05-18 14:31:46 UTC
OOo 3.1 is released. Please check the issue, if it still exists in OOo 3.1. If
yes, please work on it to get it fixed in one of the next releases. Until then
the issue get the target 3.x.
Comment 13 aardvark12 2011-02-11 17:14:36 UTC
Created attachment 75820 [details]
February 11, 2011 update of en_US.dic;  146,540 words
Comment 14 aardvark12 2011-02-11 17:18:53 UTC
The word list was pruned of specialized or obscure words, particularly if those
might interfere with finding more common words. As example, 'chough' and
'scoter' are birds, but most people will be interested in typing 'cough' or
'scooter.' Sometimes choices aren't clear. 'Whicker' is a horse's whinny, but
perhaps there is a conflict with 'wicker.' 'Whicker' was removed. Often a
dictionary will list plurals for words ending in 'o' as either -os or -oes, or
words ending in 'a' as -as or -ae. If a dictionary separates the choices with
'or' then both plurals have equal weight, but a spellchecker may help a writer's
consistency by only listing the first choice.

It has not escaped my attention that removing words helps to make room for later
additions, as a number of new words and proper nouns need to be added to keep
the word list current. (May 6, 2009 version, 150,240 words. Current version
146,540 words.) The words 'shalt' and 'spake' are now in the list, but have been
marked with an exclamation point for NO SUGGEST.

Hunspell is good at dividing long words into two, and checking each portion,
useful for a Hungarian spellchecker. It is unable to handle hyphenated words.
For this spellchecker to function properly, users need to install an autocorrect
word list in Open Office, so that when 'paperclipped' is typed, 'paper-clipped'
is automatically substituted. This is also true for some accented words, so that
typing 'elan' produces 'élan.' Unfortunately, Open Office doesn't use the
autocorrect feature this way.

Sources, listed in order of preference.

1) http://www.thefreedictionary.com/
American Heritage Dictionary, and Collins English Dictionary

2) http://dictionary.reference.com/
Random House Dictionary, Collins English Dictionary, Webster's Unabridged Dictionary

3) http://www.merriam-webster.com/
Merriam-Webster Dictionary

4) http://oxforddictionaries.com/?attempted=true
The Oxford English Dictionary

Dictionaries often disagree on compound words, or on spelling. Generally the
Random House Dictionary is very good, but it gives a spelling of 'mujahedin.'
Going to an Arabic source to clarify matters only adds to the confusion, as that
site gives seven possible spellings. Other dictionaries use the word
'mujahideen,' so that seems preferable. Because of past problems with WordNet, I
don't accept words with only this single source. WordNet gathers words from the
web. This says nothings about the way people write, only that people are blindly
reproducing the questionable Microsoft spellchecker, which has total dominance
in the U.S.

February 11, 2011
Comment 15 aardvark12 2011-02-16 17:22:20 UTC
Created attachment 75850 [details]
autocorrect hyphenates, compound words, grammar errors; word list
Comment 16 aardvark12 2011-02-16 17:34:02 UTC
PROBLEM WORDS, submit autocorrect suggestions in plain text

A number of problem words were created by Microsoft's spelling mistakes. Because
of Microsoft's total domination in the United States, these errors have been
compounded a billions times in the past dozen years. Some words are entering the
language. The surprising thing is that more words haven't been subverted, but
that most intelligent people still continue to write "point-blank" instead of
the ugly Microsoft "pointblank." Microsoft handled things by just removing
essential hyphens. After all, no sense paying for programmers, or for anyone who
knew even a smattering of the English language. These problem words have to be
checked every year. The Oxford English Dictionary (OED) monitors ten thousand
transitional words on a daily basis.

The OED now accepts the following nouns as one word, and they may be included in
en_US.dic (they are missing from my version):

airbag
airbase
lifebuoy
waterhole - one word OED, all other dictionaries, two words

Note the following key words:

"all right" - the only correct usage. "alright" not acceptable
"point-blank" (Microsoft's word, "pointblank," not acceptable)

The preferred choice for "cafe" is now without an accented e.

The OED has "razor blade." Everyone else uses "razorblade."

The American Heritage Dictionary has "mockup," "mahjong," and "housepainter."
OED and others use "mock-up." Almost everyone uses "mah-jongg" and "house
painter." (May want to put "mah-jongg" in the autocorrect list.)

The OED has "hot plate" and "hot pot," but Collins English Dictionary has these
as one word.

Here are some words I really hate to look up every year. Real dictionaries list
them as two words. This isn't a complete list, but these words are noted in the
attached autocorrect suggestions file.

bean sprouts
black light
coal mine
con man, con men
drift net
drop kick
fire truck
floor show
fly swatter [one dic. has flyswatter as second-rate choice]
gun battle
hair dryer
ice pack
land mine
love child
milk shake
nose cone
school day
six-shooter [hyphenation is correct, as here]
staff room [British usage, a teachers' lounge, one word]
tea bag
tea leaves
trash can
water mill

Hunspell does not handle hyphenated words, but these can be substituted using
the autocorrect feature of Open Office. Also some accented words aren't found by
Hunspell, and they are in the list, as well as recommendations for various word
substitutions. The fifteen-page plain text file is attached. 
Comment 17 aardvark12 2011-03-24 01:11:28 UTC
Created attachment 76178 [details]
March 23, 2011: use hyphenated words; detect accents; many new words 

The revisions and updates are done to my Open Office U.S. English spellchecker. There is a 55,000-word difference to the official Open Office version of December 27, 2010.

WHAT WAS DONE

Hyphenated words may now be checked as a unit. This makes Hunspell a professional spellchecker. Common words like "has-beens," "have-nots," or "lean-tos" no longer display an error, as such words are in en_US.dic. Type "Toulouse-letreck" and the speller will recommend "Toulouse-Lautrec." To use the new hyphenated-words feature you need Open Office 3.2 or later (doesn't work on 3.1). Tested on 3.2.1.

Changed the affix file. Now Hunspell will recommend accented words such as "señor, cliché, touché, piñata, or garçon." These were in my word list, but Hunspell was unable to suggest them as replacements. Now Hunspell recommends "all right" for "alright." Further modified affix file for better dictionary compression, new prefix (PFX O) and suffix (SFX Q and W) entries. (First line of SFX entry contains N for NO; the man/woman SFX entries in en_AU.aff and en_NZ.aff are incorrect.)

Made major revisions to word list. Added hundreds of new words like: cyberbullying, cyberbully, cybersecurity, bridezilla, overleveraged (from subprime mortgage crash), BFF, microblog, microblogging, steampunk. New words are approved by Oxford English Dictionary. Added a large number of hyphenated words.

David M. Dibble
Comment 18 aardvark12 2011-04-19 23:04:00 UTC
Created attachment 76414 [details]
April 19, 2011: affix file for better compression; expand hyphenated entries

Changed affix file to give much better compression. On a reasonable sized word list, the dictionary file will be 70-80k smaller than before. Due to new affix file, current en_US.dic is only 568k.

In order to improve hyphenation entries, went through entire word list. While doing this, removed variant words and plurals for clarity. Many hyphenated words can be formed from individual words, and need not be included in the spellchecker. The most helpful entries should now be in en_US.dic.
Comment 19 Pedro Giffuni 2011-09-04 19:42:01 UTC
Hello;

Please note that we will not receive contributions under copyleft licenses
anymore. When you can, we recommend Apache License 2.0 to be consistent with the rest of the suite.
http://www.apache.org/licenses/LICENSE-2.0

This license is considered GPL3 compatible by the FSF and will permit wider distribution of your work.
Comment 20 Pedro Giffuni 2011-12-01 15:51:10 UTC
Dictionaries were removed as part of Apache IP Clearance process.