Issue 13451

Summary: breakiterator should be locale dependant
Product: Internationalization Reporter: jesus
Component: codeAssignee: stefan.baltzer
Status: CLOSED FIXED QA Contact: issues@l10n <issues>
Severity: Trivial    
Priority: P3 CC: issues, khendricks, openofficeissuezilla
Version: OOo 1.0.3   
Target Milestone: ---   
Hardware: PC   
OS: All   
Issue Type: ENHANCEMENT Latest Confirmation in: ---
Developer Difficulty: ---
Attachments:
Description Flags
Catalan locale fixes none

Description jesus 2003-04-15 12:11:36 UTC
word terminator characters in i18n/breakiterator.cxx should be locale dependant,
because not all the languages use the same ones.
For example, in some latin languages the character '-' is a part of the word.
Comment 1 Dieter.Loeschky 2003-04-22 12:32:47 UTC
DL->FME: Would you please takeover?
Comment 2 frank.meies 2003-04-22 12:38:27 UTC
FME->Karl: Looks like yours.
Comment 3 karl.hong 2003-04-29 18:24:25 UTC
Karl: we need more information about the issue. What language '-' 
should be consider as part of the word, and some examples for us to 
understand the issue.
Comment 4 karl.hong 2003-04-29 18:30:05 UTC
Karl: send issue back to submitter for more information.
Comment 5 jesus 2003-05-01 19:58:02 UTC
Hi all,

In i18n/breakiterator.cxx character delimiters are defined in a static
char for all languages and they should be locale dependant, because
they are not the same for all the languages in the world.

For example, in catalan '-' can't be a delimiter because in some cases
we append the subject of the action to verb and even the object of the
action. For example:

"mirar-se" (Look at oneself)
"comprar-vos" 
"donem-nos les mans"
"aneu-vos-en!" ("You, go away from here!")

With this current constant delimiters we have to change that file in
order to make the spellchecker work properly with catalan documents.
But I am sure that some other languages, once implemented, will have
the same problem with this or another character.

I think the best way for a real multi-language application is to have
this settings in the locale data file for each language as they are
locale dependant.

Regards,
Comment 6 karl.hong 2003-05-01 20:26:59 UTC
Hi,

The i18n/breakiterator.cxx you refered is an old implementation for 
OO 1.0.*, from 1.1, we are using i18npool/source/breakiterator, which 
has its own impelemtation for Asian languages and calls ICU 
breakiterator for other languages.

We could implement different breakiterator for different langauge. 
Actually we have a special one for Catalan. Here is requirement for 
us to implement Catalan word breakiterator for spellchecker.

"In Catalan the word "cel¡?la" where "¡?" is character 0xB7 should be 
recognized as one word. This should work that way for spelling and 
hyphenation."

I added this character 0xb7 as midLetter in 
i18npool/source/breakiterator/data/dict_word_ca.h as ICU 
breakiterator rule for Catalan.

If you think '-' is another case as midLetter, you could add it in 
same file.
Comment 7 karl.hong 2003-05-15 19:22:30 UTC
Ok I will add '-' as MidLetter in Calatan dictionary word 
breakiterator in next release.
Comment 8 jesus 2003-05-16 21:56:34 UTC
Hi Karl,

I'm sorry but I've been really busy these latest weeks :-( 
I will implement and test this code myself for the catalan version of 
1.1. I hope this code could then be easly integrated into the 2.0 
branch.

There are also some minor bugs in the localedata for catalan. I'll 
post some patches next week and maybe also a patch for the 1.1 
breakiterator.       

Thanks for your time and regards,
Comment 9 jesus 2003-06-03 11:29:26 UTC
Created attachment 6613 [details]
Catalan locale fixes
Comment 10 jesus 2003-06-03 11:31:20 UTC
This patch fixes some localedata info and adds 0x2d '-' as another
midLetter case.
Comment 11 karl.hong 2003-08-09 01:57:00 UTC
I have taken the patch provided and them in CWS i18n08.
Comment 12 karl.hong 2003-09-10 00:06:07 UTC
Verified in CWS i18n08.
Comment 13 ooo 2003-09-10 10:24:37 UTC
Reassign to QA.
Comment 14 oc 2003-09-22 15:50:12 UTC
Adjusting owner
Comment 15 oc 2003-09-22 15:50:44 UTC
adjusting resolution
Comment 16 stefan.baltzer 2003-11-06 13:17:41 UTC
SBA: Verified in CWS i18n08. 
I tried the given examples mirar-se, comprar-vos, donem-nos and
aneu-vos-en. All are regarded as correct when I activate a catalan
spellchecker. When adding one wrong letter (i.e. "mirar-sez"), the
whole word (including the "-") gets marked as wrong. 
Set to verified.
Comment 17 khendricks 2003-12-09 14:48:30 UTC
KNH -> hkong 
 
Hi, 
 
I am glad to see this is fixed.  Therre is a related issue for Italian in that they would like to make ' 
(0x27) a break point so that words like l'Epido  bell'canto are properly separated into its two parts. 
This would allow its spellhchecking dictioanry size to literally be cut in half accorind to their it_IT 
project team. 
 
I see the separate maps in i18npool/source/breakiterator/data 
 for ca, ko, ja, zh and read the discussion in this issue but I am  not certain I understand how to 
make a change for a language without a map like dict_word_ca.h 
 
So would you please provide some hints on how would someone go about adding the ' as a break 
point specific to Italian. 
 
Would we have to create am dict_word_it.h file? 
 
How would we integrate such a change into OOo? 
 
Thanks, 
 
Kevin 
 
Comment 18 jack.warchold 2004-08-04 14:37:06 UTC
seen good in OOo680_m49