Apache OpenOffice (AOO) Bugzilla – Issue 51772
Quotes in Hebrew workbreaking don't work during spellcheck
Last modified: 2013-08-07 14:44:35 UTC
At present, OOo considers quotes mark to be the end of a word during spellcheck. This should be changed. Here is background (provided by Jonathan Ben-Avraham): ----------- The Hebrew writing system uses one double quotation mark as the penultimate character of a lexical item in order to indicate that the lexical item is either an acronym or a Hebrew number, rather than a normal word. The double quote mark is always preceded and followed by a Hebrew character, never a whitespace character, punctuation mark or single quote. The reader uses domain and contextual knowledge in order to distinguish between acronyms and numbers. When the character set allows distinct opening and closing double quote glyphs, then Hebrew uses the closing (slanting from upper right to lower left) double quotation mark. The Hebrew writing system uses one single quote mark after (visually to the left of) a Hebrew consonant as an accent mark to indicate that the consonant should be pronounced in an alternative way (usually to indicate a foreign pronunciation for a letter that does not exist in Hebrew), or to indicate a contraction. The single quote can be after any character of a word, including in word final position (followed by whitespace or a punctuation mark. Words that use the single quote as either an accent mark or contraction indicator are not listed in common Hebrew dictionaries. In addition, Hebrew also uses double and single quotation marks in pairs to indicate quotations in the same way that Western languages do. The above explanations unfortunately reflect the way the key mappings are set up in Israel today for historical reasons but is not the way things should really be in the ideal world. A real Unicode purist would use \u05F4 (HEBREW_GERSHAYIM) instead of a \u0022 (They look the same), and \u05F3 (HEBREW_GERESH) instead of \u0027. The break iterator code in OOo should be fixed to deal with *both* the common and the correct usages. Hebrew words can be hyphenated between any two characters. There are no syllable based hyphenation rules as in English. There is no Hebrew hyphen (yes, \u05BE HEBREW_MAQAF is not a hyphen). --------- This is only an issue during spellchecking. When moving from word to word using Ctrl/Right or Ctrl/Left quoteש שרק *not* treated as the end of a word. This is correct. However, during spellchecking, the behavior is not correct. There is more on this subject at: http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=936644 and the patch submitted at: http://www.openoffice.org/issues/show_bug.cgi?id=51661
Reassigned to SBA.
*** Issue 55809 has been marked as a duplicate of this issue. ***
SBA->FME: As discussed, yours. Note: The closed duplicate (issue 55809 "Script type change is always regarded as a word boundary") has an attachment with several quote characters in Hebrew words. It was a follow-up of issue 51661 that was was fixed within break iterator.
SBA: Summary adjusted.
Created attachment 31966 [details] Changes handling of RTL numstrings, and adjusts X coordinate for RTL in PaintBullet
Sorry, I attached the patch to the wrong issue (as the name of the patch indicates).
Created attachment 57916 [details] Changes script type of quote, geresh, gershayim to WEAK in Hebrew context
I've attached a patch which changes the script type of double-quote, apostrophe, geresh, gershayim to WEAK in Hebrew context, thus not breaking a Hebrew word at those characters. See the discussion at http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=2146651
fme->khong: This only affects break iterator code. Please take over.
Karl->fme, breakiterator seems doing right thing, following program show Hebrew dictionary word breakiterator considers double quote as part of the word, Sub Main xBI = createUnoService("com.sun.star.i18n.BreakIterator") Dim aLocale as new com.sun.star.lang.Locale aLocale.Language = "he" nWordType = 2 ' WordType::DICTIONARY_WORD aTxt = CHR$(&H5d0) +CHR$(&H22)+CHR$(&H5d0) aBoundary = xBI.getWordBoundary( aTxt, 0, aLocale, nWordType, true ) print aBoundary. StartPos, aBoundary.EndPos End Sub It print (0,3). Something must be wrong in Writer to send the word to spellchecker.
@ayaniger: Well, for issue 16354 I already implemented some code that changes the script type obtained from i18n to COMPLEX in case the direction of the character run is RTL, see porlay.cxx. For a couple of tasks (e.g., spell checking, word count etc.) the SwScanner::NextWord method is used. This method contains some code that clips the words at script type boundaries. Now in my opinion the problem is that the SwScanner::NextWord function does not use the ScriptInfo data structure (which contains the 'changed' script type) but rather directly used the break iterator to find the script boundaries. What do you think?
@fme: Yes, that seems correct.
Just for book keeping - is this patch still worked on? Or shall we reject it and set the issue type to "DEFECT"?
If Frank is not working on this, I will try to work on it next week.
Setting target 3.x for the time being
I'm adding this comment to all open issues with Issue Type == PATCH. We have 220 such issues, many of them quite old. I apologize for that. We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0. If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know. On the other hand, if the patch is no longer relevant, please let us know that as well. If you have any general questions or want to discuss this further, please send a note to our dev mailing list: dev@openoffice.apache.org Thanks! -Rob