Apache OpenOffice (AOO) Bugzilla – Issue 58513
Problem with compound words with hyphens in Finnish text
Last modified: 2013-08-07 15:01:25 UTC
Compound words containing a hyphen are incorrectly considered as separate words by OpenOffice.org (m142 built by Pavel Janík and m138 built by myself) Steps to reproduce: 1) Open a new, empty Writer document 2) From character properties, set language to Finnish 3) Type "auto" (without the quotes) into the document 4) Observe that in the document properties, statistics show "Number of Words: 1" 5) Replace "auto" with "kuorma-auto" in the document 6) Now statistics show "Number of Words: 2". This is incorrect, because (at least) in Finnish "kuorma-auto" is a single word. This bug also breaks spellchecking with our (partially closed-source) spellchecker, because typing a word like "Kaakkois-Suomi" results in "Kaakkois" being marked as a spelling error (it would be an error if it was written without -Suomi) but the compound word is in fact correct.
Harri: Need a new breakiterator for Finnish (For example, Hungarian breakiterator patterns have already contained both of dash and n-dash, as word characters: i18npool/source/breakiterator/data/dict_word_hu.txt) Laci
Created attachment 31915 [details] Finnish breakiterator data (not yet ready for use)
The attached file is identical to the default dict_word.txt except that I have added [:name = HYPHEN-MINUS:] to $MidLetter. As I do not completely understand the syntax of this file, I thought that this is a safe and minimal change to make spellchecking work. After building OOo with this file added, "Kaakkois-Suomi" is no longer flagged as a spelling error, which is good. But word count is still wrong. It is also possible for the hyphen to be the first or the last letter of a word, as in "Kaakkois- ja Keski-Suomi". This still does not work, although I am not sure if our spellchecker would handle this correctly anyway. I do not know if it is correct to use n-dash in these cases; I have asked about this on dev@fi.openoffice.org, perhaps someone from there can comment on this issue.
Created attachment 32633 [details] Finnish breakiterator data (second attempt)
According to a lot of people, n-dash is not a proper word character in Finnish so the default handling is fine for it. Attached second version of dict_word_fi.txt allows HYPHEN-MINUS to exist anywhere within a Finnish word but makes no other changes to breakiterator rules. This seems to be enough to fix our compound word handling. I have tested this myself, and hope that it is a safe fix to be added for 2.0.2. Word counting is still not fixed by this though, maybe a separate issue should be filed for that. The actual difference between the default dict_word.txt and dict_word_fi.txt is the following: --- dict_word.txt 2005-11-04 17:32:41.000000000 +0200 +++ dict_word_fi.txt 2005-12-10 15:11:39.000000000 +0200 @@ -24,7 +24,7 @@ $Ideographic = [:Ideographic:]; $Hangul = [:Script = HANGUL:]; -$ALetter = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW PUNCTUATION GERESH:] +$ALetter = [\u0002 [:Alphabetic:] [:name= COMMERCIAL AT:] [:name= HEBREW PUNCTUATION GERESH:] [:name = HYPHEN-MINUS:] - $Ideographic - $Katakana - $Hangul
I guess this issue should be moved to component l10n and type set to PATCH, but I do not have the required premissions to do that. Could someone please help here?
I reassign this issue for you
Hi Thomas, please could you look into this issue and evaluate the patch? If someone else is closer to the break iterator subject, please feel free to forward. Thanks. Lutz.
TL->KHONG: Breakiterator issue. Please take over. Thanks!
dict_word is used for dictionary word break, edit_word is for cursor travelling, while count_word is for word count. Do you think we need to add dict_word_fi.txt, edit_word_fi.txt and count_word_fi.txt to handle dash in all cases?
At least dict_word_fi.txt and count_word_fi.txt would be needed. edit_word_fi.txt is a harder question. I did a test form for this (see http://www.hunspell-fi.org/ooo/tests/breakiterator.html ) and according to these tests MS Word, for some reason, does consider "-" as a word separator when editing text, but not during spellchecking or in word count. So maybe we want to do the same and not touch edit_word_fi.txt. In the test form there are also some tests for words like "USA:ssa" (="in the USA") that are also used in Finnish. They are broken in the similar way in OOo, and the fix would be to add colon to MidLetter (in dict_word_fi.txt and count_word_fi.txt). Perhaps colon should be there by default? At least in http://www.unicode.org/reports/tr29/#Word_Boundaries Table 2 already lists colon in MidLetter. I have not tested these additional changes, I will write a note after I manage to build OOo with these changes and verify that the behaviour will be the same as in Word.
Adding [:name= HYPHEN-MINUS:] to $ALetter and [:name= COLON:] to $MidLetter in both dict_word_fi.txt and count_word_fi.txt seems to do the right thing in my test build (m163). With these changes I get the same (correct) behaviour as with MS Word.
fixed in cws locales203.
ready for QA. re-open issue and reassign to oc@openoffice.org
reassign to oc@openoffice.org
reset resolution to FIXED
Retargeting to OOo2.0.3.
verified in internal build cws_locales203
.
closed because fix available in OpenOffice.org Developer Snapshot Build src680_m167