Apache OpenOffice (AOO) Bugzilla – Issue 13451
breakiterator should be locale dependant
Last modified: 2013-08-07 15:00:08 UTC
word terminator characters in i18n/breakiterator.cxx should be locale dependant, because not all the languages use the same ones. For example, in some latin languages the character '-' is a part of the word.
DL->FME: Would you please takeover?
FME->Karl: Looks like yours.
Karl: we need more information about the issue. What language '-' should be consider as part of the word, and some examples for us to understand the issue.
Karl: send issue back to submitter for more information.
Hi all, In i18n/breakiterator.cxx character delimiters are defined in a static char for all languages and they should be locale dependant, because they are not the same for all the languages in the world. For example, in catalan '-' can't be a delimiter because in some cases we append the subject of the action to verb and even the object of the action. For example: "mirar-se" (Look at oneself) "comprar-vos" "donem-nos les mans" "aneu-vos-en!" ("You, go away from here!") With this current constant delimiters we have to change that file in order to make the spellchecker work properly with catalan documents. But I am sure that some other languages, once implemented, will have the same problem with this or another character. I think the best way for a real multi-language application is to have this settings in the locale data file for each language as they are locale dependant. Regards,
Hi, The i18n/breakiterator.cxx you refered is an old implementation for OO 1.0.*, from 1.1, we are using i18npool/source/breakiterator, which has its own impelemtation for Asian languages and calls ICU breakiterator for other languages. We could implement different breakiterator for different langauge. Actually we have a special one for Catalan. Here is requirement for us to implement Catalan word breakiterator for spellchecker. "In Catalan the word "cel¡?la" where "¡?" is character 0xB7 should be recognized as one word. This should work that way for spelling and hyphenation." I added this character 0xb7 as midLetter in i18npool/source/breakiterator/data/dict_word_ca.h as ICU breakiterator rule for Catalan. If you think '-' is another case as midLetter, you could add it in same file.
Ok I will add '-' as MidLetter in Calatan dictionary word breakiterator in next release.
Hi Karl, I'm sorry but I've been really busy these latest weeks :-( I will implement and test this code myself for the catalan version of 1.1. I hope this code could then be easly integrated into the 2.0 branch. There are also some minor bugs in the localedata for catalan. I'll post some patches next week and maybe also a patch for the 1.1 breakiterator. Thanks for your time and regards,
Created attachment 6613 [details] Catalan locale fixes
This patch fixes some localedata info and adds 0x2d '-' as another midLetter case.
I have taken the patch provided and them in CWS i18n08.
Verified in CWS i18n08.
Reassign to QA.
Adjusting owner
adjusting resolution
SBA: Verified in CWS i18n08. I tried the given examples mirar-se, comprar-vos, donem-nos and aneu-vos-en. All are regarded as correct when I activate a catalan spellchecker. When adding one wrong letter (i.e. "mirar-sez"), the whole word (including the "-") gets marked as wrong. Set to verified.
KNH -> hkong Hi, I am glad to see this is fixed. Therre is a related issue for Italian in that they would like to make ' (0x27) a break point so that words like l'Epido bell'canto are properly separated into its two parts. This would allow its spellhchecking dictioanry size to literally be cut in half accorind to their it_IT project team. I see the separate maps in i18npool/source/breakiterator/data for ca, ko, ja, zh and read the discussion in this issue but I am not certain I understand how to make a change for a language without a map like dict_word_ca.h So would you please provide some hints on how would someone go about adding the ' as a break point specific to Italian. Would we have to create am dict_word_it.h file? How would we integrate such a change into OOo? Thanks, Kevin
seen good in OOo680_m49