Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing
|Summary:||breakiterator should be locale dependant|
|Status:||CLOSED FIXED||QA Contact:||issues@l10n <issues>|
|Priority:||P3||CC:||issues, khendricks, openofficeissuezilla|
|Issue Type:||ENHANCEMENT||Latest Confirmation in:||---|
Description jesus 2003-04-15 12:11:36 UTC
word terminator characters in i18n/breakiterator.cxx should be locale dependant, because not all the languages use the same ones. For example, in some latin languages the character '-' is a part of the word.
Comment 1 Dieter.Loeschky 2003-04-22 12:32:47 UTC
DL->FME: Would you please takeover?
Comment 2 frank.meies 2003-04-22 12:38:27 UTC
FME->Karl: Looks like yours.
Comment 3 karl.hong 2003-04-29 18:24:25 UTC
Karl: we need more information about the issue. What language '-' should be consider as part of the word, and some examples for us to understand the issue.
Comment 4 karl.hong 2003-04-29 18:30:05 UTC
Karl: send issue back to submitter for more information.
Comment 5 jesus 2003-05-01 19:58:02 UTC
Hi all, In i18n/breakiterator.cxx character delimiters are defined in a static char for all languages and they should be locale dependant, because they are not the same for all the languages in the world. For example, in catalan '-' can't be a delimiter because in some cases we append the subject of the action to verb and even the object of the action. For example: "mirar-se" (Look at oneself) "comprar-vos" "donem-nos les mans" "aneu-vos-en!" ("You, go away from here!") With this current constant delimiters we have to change that file in order to make the spellchecker work properly with catalan documents. But I am sure that some other languages, once implemented, will have the same problem with this or another character. I think the best way for a real multi-language application is to have this settings in the locale data file for each language as they are locale dependant. Regards,
Comment 6 karl.hong 2003-05-01 20:26:59 UTC
Hi, The i18n/breakiterator.cxx you refered is an old implementation for OO 1.0.*, from 1.1, we are using i18npool/source/breakiterator, which has its own impelemtation for Asian languages and calls ICU breakiterator for other languages. We could implement different breakiterator for different langauge. Actually we have a special one for Catalan. Here is requirement for us to implement Catalan word breakiterator for spellchecker. "In Catalan the word "cel¡?la" where "¡?" is character 0xB7 should be recognized as one word. This should work that way for spelling and hyphenation." I added this character 0xb7 as midLetter in i18npool/source/breakiterator/data/dict_word_ca.h as ICU breakiterator rule for Catalan. If you think '-' is another case as midLetter, you could add it in same file.
Comment 7 karl.hong 2003-05-15 19:22:30 UTC
Ok I will add '-' as MidLetter in Calatan dictionary word breakiterator in next release.
Comment 8 jesus 2003-05-16 21:56:34 UTC
Hi Karl, I'm sorry but I've been really busy these latest weeks :-( I will implement and test this code myself for the catalan version of 1.1. I hope this code could then be easly integrated into the 2.0 branch. There are also some minor bugs in the localedata for catalan. I'll post some patches next week and maybe also a patch for the 1.1 breakiterator. Thanks for your time and regards,
Comment 10 jesus 2003-06-03 11:31:20 UTC
This patch fixes some localedata info and adds 0x2d '-' as another midLetter case.
Comment 11 karl.hong 2003-08-09 01:57:00 UTC
I have taken the patch provided and them in CWS i18n08.
Comment 12 karl.hong 2003-09-10 00:06:07 UTC
Verified in CWS i18n08.
Comment 13 ooo 2003-09-10 10:24:37 UTC
Reassign to QA.
Comment 14 oc 2003-09-22 15:50:12 UTC
Comment 15 oc 2003-09-22 15:50:44 UTC
Comment 16 stefan.baltzer 2003-11-06 13:17:41 UTC
SBA: Verified in CWS i18n08. I tried the given examples mirar-se, comprar-vos, donem-nos and aneu-vos-en. All are regarded as correct when I activate a catalan spellchecker. When adding one wrong letter (i.e. "mirar-sez"), the whole word (including the "-") gets marked as wrong. Set to verified.
Comment 17 khendricks 2003-12-09 14:48:30 UTC
KNH -> hkong Hi, I am glad to see this is fixed. Therre is a related issue for Italian in that they would like to make ' (0x27) a break point so that words like l'Epido bell'canto are properly separated into its two parts. This would allow its spellhchecking dictioanry size to literally be cut in half accorind to their it_IT project team. I see the separate maps in i18npool/source/breakiterator/data for ca, ko, ja, zh and read the discussion in this issue but I am not certain I understand how to make a change for a language without a map like dict_word_ca.h So would you please provide some hints on how would someone go about adding the ' as a break point specific to Italian. Would we have to create am dict_word_it.h file? How would we integrate such a change into OOo? Thanks, Kevin
Comment 18 jack.warchold 2004-08-04 14:37:06 UTC
seen good in OOo680_m49