Issue 51772

Summary: Quotes in Hebrew workbreaking don't work during spellcheck
Product: Writer Reporter: alan
Component: codeAssignee: AOO issues mailing list <issues>
Status: CONFIRMED --- QA Contact:
Severity: Trivial    
Priority: P3 CC: elisko, issues, Mathias_Bauer, yba
Version: OOo 2.0 Beta   
Target Milestone: ---   
Hardware: All   
OS: All   
Issue Type: PATCH Latest Confirmation in: ---
Developer Difficulty: ---
Description Flags
Changes handling of RTL numstrings, and adjusts X coordinate for RTL in PaintBullet
Changes script type of quote, geresh, gershayim to WEAK in Hebrew context none

Description alan 2005-07-10 10:49:32 UTC
At present, OOo considers quotes mark to be the end of a word during spellcheck.
This should be changed. Here is background (provided by Jonathan Ben-Avraham):
The Hebrew writing system uses one double quotation mark as the penultimate
character of a lexical item in order to indicate that the lexical item is either
an acronym or a Hebrew number, rather than a normal word. The double quote mark
is always preceded and followed by a Hebrew character, never a whitespace
character, punctuation mark or single quote. The reader uses domain and
contextual knowledge in order to distinguish between acronyms and numbers. When
the character set allows distinct opening and closing double quote glyphs, then
Hebrew uses the closing (slanting from upper right to lower left) double
quotation mark.

The Hebrew writing system uses one single quote mark after (visually to the left
of) a Hebrew consonant as an accent mark to indicate that the consonant should
be pronounced in an alternative way (usually to indicate a foreign pronunciation
for a letter that does not exist in Hebrew), or to indicate a contraction. The
single quote can be after any character of a word, including in word final
position (followed by whitespace or a punctuation mark. Words that use the
single quote as either an accent mark or contraction indicator are not listed in
common Hebrew dictionaries.

In addition, Hebrew also uses double and single quotation marks in pairs to
indicate quotations in the same way that Western languages do.

The above explanations unfortunately reflect the way the key mappings are set up
in Israel today for historical reasons but is not the way things should really
be in the ideal world. A real Unicode purist would use \u05F4 (HEBREW_GERSHAYIM)
instead of a \u0022 (They look the same), and \u05F3 (HEBREW_GERESH) instead of
\u0027. The break iterator code in OOo should be fixed to deal with *both* the
common and the correct usages.

Hebrew words can be hyphenated between any two characters. There are no syllable
based hyphenation rules as in English. There is no Hebrew hyphen (yes, \u05BE
HEBREW_MAQAF is not a hyphen).
This is only an issue during spellchecking. When moving from word to word using
Ctrl/Right or Ctrl/Left quoteש שרק *not* treated as the end of a word. This is
correct. However, during spellchecking, the behavior is not correct.

There is more on this subject at:

and the patch submitted at:
Comment 1 michael.ruess 2005-07-11 14:08:43 UTC
Reassigned to SBA.
Comment 2 stefan.baltzer 2005-10-12 13:58:28 UTC
*** Issue 55809 has been marked as a duplicate of this issue. ***
Comment 3 stefan.baltzer 2005-10-12 14:08:20 UTC
SBA->FME: As discussed, yours. 
Note: The closed duplicate (issue 55809 "Script type change is always regarded
as a word boundary") has an attachment with several quote characters in Hebrew
It was a follow-up of issue 51661 that was was fixed within break iterator.
Comment 4 stefan.baltzer 2005-10-12 14:20:13 UTC
SBA: Summary adjusted.
Comment 5 alan 2005-12-01 20:18:03 UTC
Created attachment 31966 [details]
Changes handling of RTL numstrings, and adjusts X coordinate for RTL in PaintBullet
Comment 6 alan 2005-12-01 20:20:22 UTC
Sorry, I attached the patch to the wrong issue (as the name of the patch indicates).
Comment 7 alan 2008-11-12 07:43:00 UTC
Created attachment 57916 [details]
Changes script type of quote, geresh, gershayim to WEAK in Hebrew context
Comment 8 alan 2008-11-12 07:45:54 UTC
I've attached a patch which changes the script type of double-quote, apostrophe,
geresh, gershayim to WEAK in Hebrew context, thus not breaking a Hebrew word at
those characters. See the discussion at
Comment 9 frank.meies 2008-11-13 07:57:51 UTC
fme->khong: This only affects break iterator code. Please take over.
Comment 10 karl.hong 2008-12-15 22:43:41 UTC
Karl->fme, breakiterator seems doing right thing, following program show Hebrew
dictionary word breakiterator considers double quote as part of the word,

Sub Main

xBI = createUnoService("")

Dim aLocale as new
aLocale.Language = "he"

nWordType = 2	' WordType::DICTIONARY_WORD

aTxt = CHR$(&H5d0) +CHR$(&H22)+CHR$(&H5d0)

aBoundary = xBI.getWordBoundary( aTxt, 0, aLocale, nWordType, true )
print  aBoundary. StartPos, aBoundary.EndPos
End Sub

It print (0,3). Something must be wrong in Writer to send the word to spellchecker.
Comment 11 frank.meies 2009-01-05 14:08:18 UTC
@ayaniger: Well, for issue 16354 I already implemented some code that changes
the script type obtained from i18n to COMPLEX in case the direction of the
character run is RTL, see porlay.cxx. For a couple of tasks (e.g., spell
checking, word count etc.) the SwScanner::NextWord method is used. This method
contains some code that clips the words at script type boundaries. Now in my
opinion the problem is that the SwScanner::NextWord function does not use the
ScriptInfo data structure (which contains the 'changed' script type) but rather
directly used the break iterator to find the script boundaries. What do you think?
Comment 12 alan 2009-01-06 11:44:26 UTC
@fme: Yes, that seems correct.
Comment 13 Mathias_Bauer 2009-05-07 15:45:42 UTC
Just for book keeping - is this patch still worked on? Or shall we reject it and
set the issue type to "DEFECT"?
Comment 14 alan 2009-05-07 16:48:10 UTC
If Frank is not working on this, I will try to work on it next week.
Comment 15 Mathias_Bauer 2009-05-25 16:23:30 UTC
Setting target 3.x for the time being
Comment 16 Rob Weir 2013-03-11 15:04:50 UTC
I'm adding this comment to all open issues with Issue Type == PATCH.  We have 220 such issues, many of them quite old.  I apologize for that.  

We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0.

If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know.

On the other hand, if the patch is no longer relevant, please let us know that as well.

If you have any general questions or want to discuss this further, please send a note to our dev mailing list: