Issue 19563 - stemming and morphological generation in the thesaurus
Summary: stemming and morphological generation in the thesaurus
Status: CLOSED FIXED
Alias: None
Product: General
Classification: Code
Component: thesaurus (show other issues)
Version: 3.3.0 or older (OOo)
Hardware: All All
: P3 Trivial with 1 vote (vote)
Target Milestone: OOo 3.1
Assignee: stefan.baltzer
QA Contact: issues@lingucomponent
URL:
Keywords:
: 51889 76272 (view as issue list)
Depends on:
Blocks:
 
Reported: 2003-09-14 16:39 UTC by lars
Modified: 2013-02-24 20:40 UTC (History)
7 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
en_US.dic patch (790 bytes, patch)
2008-06-03 19:51 UTC, nemeth.lacko
no flags Details | Diff
en_US.aff patch, see Hunspell manual for the morphological notation (http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754) (521 bytes, patch)
2008-06-03 19:53 UTC, nemeth.lacko
no flags Details | Diff
patched en_US dictionary files (no need to apply the previous patches) (244.47 KB, application/octet-stream)
2008-07-02 12:16 UTC, nemeth.lacko
no flags Details
Wordlist Hunspell en_US, en_CA spelling and morphological dictionaries (446.13 KB, application/x-compressed)
2008-12-12 01:41 UTC, nemeth.lacko
no flags Details
improved suggestions for "astronauts": spacemen, cosmonauts, travelers (screenshot, note: mostly British "traveller" and its plural form arn't there in the en_US spelling and morphological dictionary) (22.09 KB, image/png)
2008-12-12 02:10 UTC, nemeth.lacko
no flags Details
Dictionaries, release 2 (fixed morphological codes of comparative affixes) (446.16 KB, application/x-compressed)
2008-12-12 09:36 UTC, nemeth.lacko
no flags Details
English spelling and morphological dictionary conversion script (5.63 KB, application/x-compressed)
2008-12-18 16:40 UTC, nemeth.lacko
no flags Details
Test extension (en_US dictionaries, but for to the hu_HU locale) (259.06 KB, application/x-compressed)
2009-01-23 15:47 UTC, nemeth.lacko
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description lars 2003-09-14 16:39:13 UTC
thesaurus should also search synonyms omitting "ing" when such a word is looked 
up

it could seperate the results into a infinitive and a found as entered list


oh: and increase the size of the thesaurus dialogue, so that one see more 
listed entries at once!
Comment 1 thomas.lange 2003-09-15 08:03:25 UTC
Since the thesaurus is used for several languages it is not a good
idea to add lang specific code like removing "ing" or for example
adding plurals to the code itself. There is no end to be seen when
starting this.
I think instead the words should be available in the word list of the
thesaurus itself.

TL->Kevin: Can you please take over?

TL->OH: Please submit a seperate bug for the dialog size.
Comment 2 khendricks 2003-09-16 13:51:05 UTC
Hi, 
 
Yes, the thesaurus needs a lot of work to group synonyms by meaning (which will 
"fix" the dialog problem" and to greatly expanbd the wordlist to handle more words. 
 
All of this is in the works but requires a lot of volunteer help and time. 
 
Changing this to started ... 
 
Kevin 
 
Comment 3 khendricks 2003-09-16 13:51:42 UTC
. 
 
Comment 4 lars 2003-09-16 15:39:39 UTC
see also issue 19584 (thesaurus dialogue size improvement), issue 
19586 (present thesaurus results more organized) and issue 19647 (on-
the-fly thesaurus)

> requires a lot of volunteer help and time.


yes; perhaps there already are such lists (databases) on the internet 
somewhere, more specifically some groups are working on it already 
perhaps. One can build up on their work or work together with them or 
if none such groups exist work together with other projects which 
benefit from such work, ie. Mozilla spellcheck or thesaurus group 
(does a moz thesaurus group exist? hmm, can't find it on their 
projects page) (actually every (non-)human (:-)) benefits from this 
work ((don't hit me:) comparing vocabulary with Lego and Duplo (Lego 
Creator): more vocabulary as more quantity of and more varied pieces 
of Lego (bricks/blocks) allowing to build more than with big Duplo 
bricks -- and if others "understand" these pieces one can play 
together (uhui......)).
So there can be started a new ((one-and-only) reference) vocabulary 
database for all languages :-)

A "group" which can help on this field is http://dict.leo.org (for 
english and german)  - Sun sponsored by the way.
Comment 5 ooolist2007 2005-10-16 12:21:30 UTC
FYI: It will be possible to solve this, not only for -ing, with hunstem (part 
of hunspell). 
 
Comment 6 ooolist2007 2005-10-16 13:53:11 UTC
*** Issue 51889 has been marked as a duplicate of this issue. ***
Comment 7 nemeth.lacko 2005-11-23 10:52:30 UTC
Target of Hunspell and Thesaurus integration: 2.0.2 (with morphological
generations, for example: making -> doing).
Need also a better American English dictionary based on real affixes.
(It seems, British is good.)
Comment 8 nemeth.lacko 2006-02-06 14:12:24 UTC
Target: 2.0.3
Comment 9 pavel 2006-05-17 07:47:40 UTC
nemeth?
Comment 10 nemeth.lacko 2006-05-17 08:40:17 UTC
nemeth->pjanik: I hope, 2.0.4.
Comment 11 milek_pl 2007-04-13 16:51:41 UTC
*** Issue 76272 has been marked as a duplicate of this issue. ***
Comment 12 Mathias_Bauer 2008-01-07 10:59:05 UTC
Any news? I suggest to move the target to 3.x as otherwise we need a fix ASAP.
Comment 13 nemeth.lacko 2008-01-07 13:05:59 UTC
New target: 3.0

Very good news, that the latest Hunspell release has language independent
stemming and morphological generation functions for this task, tested with the
new Hungarian spelling and morphological dictionary. Next months I will work on
a Hungarian thesaurus project, and I plan to fix this issue, too. Any help would
be welcome, especially for the Hunspell-OOo thesaurus integration.
Comment 14 miles 2008-05-17 20:28:10 UTC
This would be really great, for Slovenian thesaurus without this capability is
useless. I hope you make it for 3.0.

Thanks and good luck! Will gladly test it with Slovenian files if it works as it
should.
Comment 15 nemeth.lacko 2008-05-18 13:23:49 UTC
Filmsi: Thanks for your kind words. Next week I will release a test version and
mail to the lingu-dev mailing list, also make a CWS. Most of the stemming will
work without modification of the spelling dictionary, if the affix file contains
real affixes.
Comment 16 nemeth.lacko 2008-05-27 12:31:52 UTC
Fixed in the CWS hunspell4thesaurus.

Test data for stemming:

Press Ctrl-F7 (thesaurus) on "facts" in the Writer. MyThes thesaurus will stem
"facts" using UNO interface of the spellchecker component, and show the synonyms
of "fact".

Comment 17 nemeth.lacko 2008-05-27 14:43:04 UTC
Issue Type: DEFECT. (Maybe morphological generation is an enhancement, but
stemming is a bug fix and basic competitive feature of the thesaurus.)


Comment 18 nemeth.lacko 2008-06-03 19:50:06 UTC
Test build:

http://hunspell.sourceforge.net/OOo_3.0.0_080603_LinuxIntel_install.tar.gz

en_US dictionary patch for affixation test (attached also in universal diff format):

-----------en_US.dic.diff------------------
5561c5561
< bet/MS
---
> bet/MS        ts:nom
8945c8945
< cat/SMRZ
---
> cat/SMRZ      ts:nom
30871c30871
< kitty/SM
---
> kitty/SM      ts:nom
33932c33932
< mammal/SM
---
> mammal/SM     ts:nom
43289c43289
< pool/MDSG
---
> pool/MDSG     ts:nom
44947c44947
< pussy/TRSM
---
> pussy/TRSM    ts:nom
------------en_US.aff.diff----------------------
92,95c92,95
< SFX S   y     ies        [^aeiou]y
< SFX S   0     s          [aeiou]y
< SFX S   0     es         [sxzh]
< SFX S   0     s          [^sxzhy]
---
> SFX S   y     ies        [^aeiou]y is:pl
> SFX S   0     s          [aeiou]y is:pl
> SFX S   0     es         [sxzh] is:pl
> SFX S   0     s          [^sxzhy] is:pl
------------------------------------------------

Test with the patched OOo and en_US.dic:

1. See "kitties"+Ctrl-F7 in Writer. Thesaurus dialog shows "polls" and "bets"
synonyms instead of "poll" and "bet" in the first meaning "poll".


2. Choose "kitty-cat" meaning in the dialog. It has "pussies", "domestic cats"
and "house cats" synonyms instead of "pussy", "domestic cat" and "house cat".


3. Choose "domestic cats" synonym with double click, the showed "house cat"
meaning has "house cats", "cats" and "true cats" synonyms instead of "house
cat", "cat" and "true cat".

Comment 19 nemeth.lacko 2008-06-03 19:51:58 UTC
Created attachment 54217 [details]
en_US.dic patch
Comment 20 nemeth.lacko 2008-06-03 19:53:55 UTC
Created attachment 54218 [details]
en_US.aff patch, see Hunspell manual for the morphological notation (http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754)
Comment 21 nemeth.lacko 2008-06-05 12:02:52 UTC
Reassigned for QA
Comment 22 nemeth.lacko 2008-06-20 16:47:22 UTC
Old Linux test build for dictionary developers:
http://hunspell.sourceforge.net/OOo_3.0.0_080603_LinuxIntel_install.tar.gz

Comment 23 nemeth.lacko 2008-07-02 11:57:13 UTC
Linux test build (generated on an Ubuntu 8.04):

http://hunspell.sourceforge.net/OOo_3.0.0_080702_LinuxIntel_install.tar.gz

Note: After installation on Ubuntu 8.04, I ran it with the command
LD_LIBRARY_PATH=/usr/lib /opt/openoffice.org3/program/soffice
because of a symbol lookup error (/usr/lib/libcairo.so.2: undefined symbol:
FT_Library_SetLcdFilter)
Comment 24 nemeth.lacko 2008-07-02 12:16:42 UTC
Created attachment 54888 [details]
patched en_US dictionary files (no need to apply the previous patches)
Comment 25 nemeth.lacko 2008-07-15 15:45:26 UTC
The new Windows test build contains Hunspell 1.2.6 with affix condition matching
fixes: hunspell.sourceforge.net/Windows080715/en-US.zip

en_GB test word for the affix condition fix: "entertained" (it is accepted by
the new build).
Comment 26 thorsten.ziehm 2008-08-15 10:10:09 UTC
The CWS is on target 3.0.1, therefore I changed the issue to the corresponding
target.
Comment 27 nemeth.lacko 2008-12-10 11:54:01 UTC
New test build:  ftp://ftp.fsf.hu/OpenOffice.org_hu/devel/

Changes: Affixation of multiple word expressions are forbidden now. 

Steps of the verification:

0. Install attached en_US dictionary.

1. See "kitties"+Ctrl-F7 in Writer. Thesaurus dialog shows "polls" and "bets"
synonyms instead of "poll" and "bet" in the first meaning "poll".


2. Choose "kitty-cat" meaning in the dialog. It has "pussies" instead of "pussy".


3. Choose "pussies" synonym with double click, and the "kitty" meaning has a
"kitties" suggestion instead of "kitty".
Comment 28 rene 2008-12-11 10:24:02 UTC
cws at target 3.1 -> issue at target 3.1
Comment 29 nemeth.lacko 2008-12-12 01:41:41 UTC
Created attachment 58734 [details]
Wordlist Hunspell en_US, en_CA spelling and morphological dictionaries
Comment 30 nemeth.lacko 2008-12-12 02:10:41 UTC
Created attachment 58739 [details]
improved suggestions for "astronauts": spacemen, cosmonauts, travelers (screenshot, note: mostly British "traveller" and its plural form arn't there in the en_US spelling and morphological dictionary)
Comment 31 nemeth.lacko 2008-12-12 02:48:25 UTC
Attached en_US, en_CA spelling and morphological dictionaries

They are extended equivalents of the last Wordlist Hunspell dictionaries
(version 2008-12-05).

Tested with Hunspell 1.1.12 (OOo 3.0), too. Only ordinal number checking doesn't
work with Hunspell 1.1.12. (COMPOUNDRULE didn't handle numerical flags in older
Hunspell versions.)

Attached screenshot: morphological dictionary test in the test build
(ftp://ftp.fsf.hu/OpenOffice.org_hu/devel/)

(tests for dictionary equivalence:
$ unmunch <(sed -n '24,$p' en_US.dic) en_US.aff | sort | uniq >/tmp/en_US.wordlist
$ cat <(echo badword) /tmp/en_US.wordlist | hunspell -d
hunspell-en-morph-20081212/en_US -l badword
badword
$ unmunch <(sed -n '24,$p' en_CA.dic) en_CA.aff | sort | uniq >/tmp/en_CA.wordlist
$ cat <(echo badword) /tmp/en_CA.wordlist | hunspell -d
hunspell-en-morph-20081212/en_CA -l 
badword

Reverse:

$ awk 'FILENAME~/en_CA_notaliascomp[.]aff$/{if (NF==4){n[$2]=$4; i=0;
next};i++;s[$2,i,1]=($3=="0"?0:length($3));s[$2,i,2]=($4=="0"?"":$4);s[$2,i,3]=$5"$";
next}!/\//{print $1;next}{split($1,a,"/");print
a[1];l=split(a[2],b,",");for(i=1;i<=l;i++){ m=n[b[i]]; for(j=1;j<=m;j++){if(a[1]
~ s[b[i], j, 3])print substr(a[1], 1, length(a[1])-s[b[i], j, 1]) s[b[i], j,
2]}}}' en_CA_notaliascomp.{aff,dic} | sort | uniq | sed -n '3,$p' >/tmp/en_morf.wl0
f> diff /tmp/en_CA.wordlist /tmp/en_morf.wl0 
0a1,22
> 1
> 1st
> 1th
> 2
> 2nd
> 2th
> 3
> 3rd
> 3th
> 4
> 4th
> 5
> 53000
> 5th
> 6
> 6th
> 7
> 7th
> 8
> 8th
> 9
> 9th

(only extra words)
Comment 32 miles 2008-12-12 08:17:24 UTC
Please, someone working on this, please create a document or a wiki page
explaining what this is all about. I am working with Slovenian OOo localization
team as lead translator and am also working on the Slovenian thesaurus at
www.tezaver.si.
I do not know what needs to be done for other languages, I do not know if
Slovenian spelling and hypehenation dictionaries, used in OOo, have all the
necessary attributes for this what you are trying to do. How do I check if
Slovenian dictionary has the right form, if not, what do I need to do with this
dictionary, what form should it use? Etc., etc.
So please do explain this to other localiation teams so not only English and two
or three languages would benefit, but that all localization teams could
concurrently work on their languages making OOo better for everyone.
Thanks.
Comment 33 nemeth.lacko 2008-12-12 09:36:09 UTC
Created attachment 58765 [details]
Dictionaries, release 2 (fixed morphological codes of comparative affixes)
Comment 34 nemeth.lacko 2008-12-12 17:34:02 UTC
filmsi: most of the stemming issues will work with the recent dictionaries.
For irregular dictionary items (affixed words), you can use the "st:" field to
add the stem (use tabulator instead of space for back compatibility) to the
dictionary item:

best st:good

For morphological generation, you need to specify the morphological categories
of the affixes and dictionary items by "ds:",  "is:", "ts:" fields, or
allomorphs by the "al:" items, like in the attached patches. An example for the
"al" items:

best st:good is:comp2
better st:good is:comp1
good al:better al:best ts:0 

Wiki is a good idea for more explanation. I will use it. Thanks, László
Comment 35 nemeth.lacko 2008-12-18 16:05:34 UTC
The newest versions of the spelling and morphological dictionaries were attached
to the Issue 97403.
Comment 36 nemeth.lacko 2008-12-18 16:40:43 UTC
Created attachment 58925 [details]
English spelling and morphological dictionary conversion script
Comment 37 nemeth.lacko 2009-01-23 15:47:20 UTC
Created attachment 59630 [details]
Test extension (en_US dictionaries, but for to the hu_HU locale)
Comment 38 nemeth.lacko 2009-01-23 15:56:49 UTC
I have attached a test dictionary. It contains en_US dictionaries, but installed
for hu_HU locale to exclude the collision (it is not possible to switch off a
default dictionary extension in the extension manager). The extension contains a
full en_US spelling dictionary and a minimal version thesaurus for the verification.

Steps of the verification:

1. Install attached extension.

2. Change the document language to Hungarian.

3. See "kitties"+Ctrl-F7 in Writer. Thesaurus dialog shows "polls" and "bets"
synonyms instead of "poll" and "bet" in the first meaning "poll".


4. Choose "kitty-cat" meaning in the dialog. It has "pussies" instead of "pussy".


5. Choose "pussies" synonym with double click, and the "kitty" meaning has a
"kitties" suggestion instead of "kitty".

Comment 39 stefan.baltzer 2009-01-30 13:50:06 UTC
Verified in CWS hunspell4thesaurus.
Comment 40 miles 2009-02-03 16:40:35 UTC
Sorry, should this already work in m40 (i.e. 3.1)? I downloaded a Pavel Janik
Slovenian build of m40, but couldn't make it work with Slovenian thesaurus.
Comment 41 stefan.baltzer 2009-02-04 08:19:33 UTC
sba - > filmsi: CWS hunspell4thesaurus is not yet nominated/integrated. To track
the progress:
http://eis.services.openoffice.org/EIS2/cws.ShowCWS?Path=DEV300%2Fhunspell4thesaurus
Comment 42 thorsten.ziehm 2009-07-20 14:52:03 UTC
This issue is closed automatically and wasn't rechecked in a current version of
OOo. The fixed issue should be integrated in OOo since more than half a year. If
you think this issue isn't fixed in a current version (OOo 3.1), please reopen
it and change the field 'Target Milestone' accordingly.

If you want to download a current version of OOo =>
http://download.openoffice.org/index.html
If you want to know more about the handling of fixed/verified issues =>
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues
Comment 43 thorsten.ziehm 2009-07-20 15:35:24 UTC
Sorry this issue was wrongly closed. This issue will be reopened automatically.
And will be set after that back to fixed/verified.
Comment 44 thorsten.ziehm 2009-07-20 15:39:54 UTC
Set to state 'fixed'.
Comment 45 thorsten.ziehm 2009-07-20 15:44:07 UTC
Set back to state 'verified/fixed'.

Again. Sorry for the mass of mails.
Comment 46 thorsten.ziehm 2010-02-22 15:42:11 UTC
This issue is closed automatically. It should be fixed in a version with is
available for longer than half a year (OOo 3.1). If you think this issue isn't
fixed in the current version (OOo 3.2) please reopen it. But then please pay
attention about the field 'target milestone'.
The closure was approved by the Release Status Meeting at 22nd of February 2010
and it is based on the issue handling guideline for fixed/verified issues :
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues