Issue 58513 - Problem with compound words with hyphens in Finnish text
Summary: Problem with compound words with hyphens in Finnish text
Status: CLOSED FIXED
Alias: None
Product: Internationalization
Classification: Code
Component: code (show other issues)
Version: current
Hardware: All Linux, all
: P3 Trivial (vote)
Target Milestone: ---
Assignee: oc
QA Contact: issues@l10n
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-11-27 11:30 UTC by hatapitk
Modified: 2013-08-07 15:01 UTC (History)
2 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Finnish breakiterator data (not yet ready for use) (5.18 KB, text/plain)
2005-11-29 21:15 UTC, hatapitk
no flags Details
Finnish breakiterator data (second attempt) (5.18 KB, text/plain)
2005-12-21 20:39 UTC, hatapitk
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description hatapitk 2005-11-27 11:30:07 UTC
Compound words containing a hyphen are incorrectly considered as separate 
words by OpenOffice.org (m142 built by Pavel Janík and m138 built by myself)  
  
Steps to reproduce: 
1) Open a new, empty Writer document 
2) From character properties, set language to Finnish 
3) Type "auto" (without the quotes) into the document 
4) Observe that in the document properties, statistics show "Number of Words: 
1"  
5) Replace "auto" with "kuorma-auto" in the document 
6) Now statistics show "Number of Words: 2". This is incorrect, because (at 
least) in Finnish "kuorma-auto" is a single word.  
  
This bug also breaks spellchecking with our (partially closed-source) 
spellchecker, because typing a word like "Kaakkois-Suomi" results in 
"Kaakkois" being marked as a spelling error (it would be an error if it was 
written without -Suomi) but the compound word is in fact correct.
Comment 1 nemeth.lacko 2005-11-28 11:32:50 UTC
Harri: Need a new breakiterator for Finnish (For example, Hungarian
breakiterator patterns have already contained both of dash and n-dash, as word
characters: i18npool/source/breakiterator/data/dict_word_hu.txt)

Laci

Comment 2 hatapitk 2005-11-29 21:15:14 UTC
Created attachment 31915 [details]
Finnish breakiterator data (not yet ready for use)
Comment 3 hatapitk 2005-11-29 21:29:25 UTC
The attached file is identical to the default dict_word.txt except that I have 
added [:name = HYPHEN-MINUS:] to $MidLetter. As I do not completely understand 
the syntax of this file, I thought that this is a safe and minimal change to 
make spellchecking work. After building OOo with this file added, 
"Kaakkois-Suomi" is no longer flagged as a spelling error, which is good. But 
word count is still wrong. 
It is also possible for the hyphen to be the first or the last letter of a 
word, as in "Kaakkois- ja Keski-Suomi". This still does not work, although I 
am not sure if our spellchecker would handle this correctly anyway. I do not 
know if it is correct to use n-dash in these cases; I have asked about this on 
dev@fi.openoffice.org, perhaps someone from there can comment on this issue. 
Comment 4 hatapitk 2005-12-21 20:39:19 UTC
Created attachment 32633 [details]
Finnish breakiterator data (second attempt)
Comment 5 hatapitk 2005-12-21 20:55:45 UTC
According to a lot of people, n-dash is not a proper word character in Finnish 
so the default handling is fine for it. Attached second version of 
dict_word_fi.txt allows HYPHEN-MINUS to exist anywhere within a Finnish word 
but makes no other changes to breakiterator rules. This seems to be enough to 
fix our compound word handling. I have tested this myself, and hope that it is 
a safe fix to be added for 2.0.2. 
Word counting is still not fixed by this though, maybe a separate issue should 
be filed for that. 
 
The actual difference between the default dict_word.txt and dict_word_fi.txt 
is the following: 
--- dict_word.txt       2005-11-04 17:32:41.000000000 +0200 
+++ dict_word_fi.txt    2005-12-10 15:11:39.000000000 +0200 
@@ -24,7 +24,7 @@ 
 $Ideographic = [:Ideographic:]; 
 $Hangul = [:Script = HANGUL:]; 
 
-$ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW 
PUNCTUATION GERESH:] 
+$ALetter   = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW 
PUNCTUATION GERESH:] [:name = HYPHEN-MINUS:] 
                            - $Ideographic 
                            - $Katakana 
                            - $Hangul 
 
Comment 6 hatapitk 2006-02-19 20:28:29 UTC
I guess this issue should be moved to component l10n and type set to PATCH, 
but I do not have the required premissions to do that. Could someone please 
help here?
Comment 7 lars 2006-02-19 22:12:22 UTC
I reassign this issue for you
Comment 8 lutz.hoeger 2006-04-07 07:22:20 UTC
Hi Thomas, please could you look into this issue and evaluate the patch? If
someone else is closer to the break iterator subject, please feel free to
forward. Thanks. Lutz.
Comment 9 thomas.lange 2006-04-10 11:24:42 UTC
TL->KHONG: Breakiterator issue. Please take over. Thanks!
Comment 10 karl.hong 2006-04-14 22:03:52 UTC
dict_word is used for dictionary word break, edit_word is for cursor travelling,
while count_word is for word count.

Do you think we need to add dict_word_fi.txt, edit_word_fi.txt and
count_word_fi.txt to handle dash in all cases?
Comment 11 hatapitk 2006-04-17 11:36:39 UTC
At least dict_word_fi.txt and count_word_fi.txt would be needed. 
edit_word_fi.txt is a harder question. I did a test form for this (see 
http://www.hunspell-fi.org/ooo/tests/breakiterator.html ) and according to 
these tests MS Word, for some reason, does consider "-" as a word separator 
when editing text, but not during spellchecking or in word count. So maybe we 
want to do the same and not touch edit_word_fi.txt.

In the test form there are also some tests for words like "USA:ssa" (="in the 
USA") that are also used in Finnish. They are broken in the similar way in 
OOo, and the fix would be to add colon to MidLetter (in dict_word_fi.txt and 
count_word_fi.txt). Perhaps colon should be there by default? At least in 
http://www.unicode.org/reports/tr29/#Word_Boundaries Table 2 already lists 
colon in MidLetter.

I have not tested these additional changes, I will write a note after I manage 
to build OOo with these changes and verify that the behaviour will be the same 
as in Word.
Comment 12 hatapitk 2006-04-18 18:00:21 UTC
Adding [:name= HYPHEN-MINUS:] to $ALetter and [:name= COLON:] to $MidLetter in 
both dict_word_fi.txt and count_word_fi.txt seems to do the right thing in my 
test build (m163). With these changes I get the same (correct) behaviour as 
with MS Word.
Comment 13 karl.hong 2006-05-02 19:20:00 UTC
fixed in cws locales203.
Comment 14 karl.hong 2006-05-02 19:20:39 UTC
ready for QA.

re-open issue and reassign to oc@openoffice.org
Comment 15 karl.hong 2006-05-02 19:20:47 UTC
reassign to oc@openoffice.org
Comment 16 karl.hong 2006-05-02 19:20:52 UTC
reset resolution to FIXED
Comment 17 ooo 2006-05-03 11:15:01 UTC
Retargeting to OOo2.0.3.
Comment 18 oc 2006-05-03 12:48:38 UTC
verified in internal build cws_locales203
Comment 19 oc 2006-05-03 12:49:07 UTC
.
Comment 20 oc 2006-05-11 10:43:38 UTC
closed because fix available in OpenOffice.org Developer Snapshot Build src680_m167