Issue 51772 - Quotes in Hebrew workbreaking don't work during spellcheck
Summary: Quotes in Hebrew workbreaking don't work during spellcheck
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: code (show other issues)
Version: OOo 2.0 Beta
Hardware: All All
: P3 Trivial with 2 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
: 55809 (view as issue list)
Depends on:
Blocks:
 
Reported: 2005-07-10 10:49 UTC by alan
Modified: 2013-08-07 14:44 UTC (History)
4 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Changes handling of RTL numstrings, and adjusts X coordinate for RTL in PaintBullet (2.10 KB, patch)
2005-12-01 20:18 UTC, alan
no flags Details | Diff
Changes script type of quote, geresh, gershayim to WEAK in Hebrew context (3.45 KB, patch)
2008-11-12 07:43 UTC, alan
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description alan 2005-07-10 10:49:32 UTC
At present, OOo considers quotes mark to be the end of a word during spellcheck.
This should be changed. Here is background (provided by Jonathan Ben-Avraham):
-----------
The Hebrew writing system uses one double quotation mark as the penultimate
character of a lexical item in order to indicate that the lexical item is either
an acronym or a Hebrew number, rather than a normal word. The double quote mark
is always preceded and followed by a Hebrew character, never a whitespace
character, punctuation mark or single quote. The reader uses domain and
contextual knowledge in order to distinguish between acronyms and numbers. When
the character set allows distinct opening and closing double quote glyphs, then
Hebrew uses the closing (slanting from upper right to lower left) double
quotation mark.

The Hebrew writing system uses one single quote mark after (visually to the left
of) a Hebrew consonant as an accent mark to indicate that the consonant should
be pronounced in an alternative way (usually to indicate a foreign pronunciation
for a letter that does not exist in Hebrew), or to indicate a contraction. The
single quote can be after any character of a word, including in word final
position (followed by whitespace or a punctuation mark. Words that use the
single quote as either an accent mark or contraction indicator are not listed in
common Hebrew dictionaries.

In addition, Hebrew also uses double and single quotation marks in pairs to
indicate quotations in the same way that Western languages do.

The above explanations unfortunately reflect the way the key mappings are set up
in Israel today for historical reasons but is not the way things should really
be in the ideal world. A real Unicode purist would use \u05F4 (HEBREW_GERSHAYIM)
instead of a \u0022 (They look the same), and \u05F3 (HEBREW_GERESH) instead of
\u0027. The break iterator code in OOo should be fixed to deal with *both* the
common and the correct usages.

Hebrew words can be hyphenated between any two characters. There are no syllable
based hyphenation rules as in English. There is no Hebrew hyphen (yes, \u05BE
HEBREW_MAQAF is not a hyphen).
---------
This is only an issue during spellchecking. When moving from word to word using
Ctrl/Right or Ctrl/Left quoteש שרק *not* treated as the end of a word. This is
correct. However, during spellchecking, the behavior is not correct.

There is more on this subject at:
http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=936644

and the patch submitted at:
http://www.openoffice.org/issues/show_bug.cgi?id=51661
Comment 1 michael.ruess 2005-07-11 14:08:43 UTC
Reassigned to SBA.
Comment 2 stefan.baltzer 2005-10-12 13:58:28 UTC
*** Issue 55809 has been marked as a duplicate of this issue. ***
Comment 3 stefan.baltzer 2005-10-12 14:08:20 UTC
SBA->FME: As discussed, yours. 
Note: The closed duplicate (issue 55809 "Script type change is always regarded
as a word boundary") has an attachment with several quote characters in Hebrew
words. 
It was a follow-up of issue 51661 that was was fixed within break iterator.
Comment 4 stefan.baltzer 2005-10-12 14:20:13 UTC
SBA: Summary adjusted.
Comment 5 alan 2005-12-01 20:18:03 UTC
Created attachment 31966 [details]
Changes handling of RTL numstrings, and adjusts X coordinate for RTL in PaintBullet
Comment 6 alan 2005-12-01 20:20:22 UTC
Sorry, I attached the patch to the wrong issue (as the name of the patch indicates).
Comment 7 alan 2008-11-12 07:43:00 UTC
Created attachment 57916 [details]
Changes script type of quote, geresh, gershayim to WEAK in Hebrew context
Comment 8 alan 2008-11-12 07:45:54 UTC
I've attached a patch which changes the script type of double-quote, apostrophe,
geresh, gershayim to WEAK in Hebrew context, thus not breaking a Hebrew word at
those characters. See the discussion at
http://l10n.openoffice.org/servlets/BrowseList?list=dev&by=thread&from=2146651
Comment 9 frank.meies 2008-11-13 07:57:51 UTC
fme->khong: This only affects break iterator code. Please take over.
Comment 10 karl.hong 2008-12-15 22:43:41 UTC
Karl->fme, breakiterator seems doing right thing, following program show Hebrew
dictionary word breakiterator considers double quote as part of the word,

Sub Main

xBI = createUnoService("com.sun.star.i18n.BreakIterator")

Dim aLocale as new com.sun.star.lang.Locale
aLocale.Language = "he"

nWordType = 2	' WordType::DICTIONARY_WORD

aTxt = CHR$(&H5d0) +CHR$(&H22)+CHR$(&H5d0)

aBoundary = xBI.getWordBoundary( aTxt, 0, aLocale, nWordType, true )
print  aBoundary. StartPos, aBoundary.EndPos
 
End Sub

It print (0,3). Something must be wrong in Writer to send the word to spellchecker.
Comment 11 frank.meies 2009-01-05 14:08:18 UTC
@ayaniger: Well, for issue 16354 I already implemented some code that changes
the script type obtained from i18n to COMPLEX in case the direction of the
character run is RTL, see porlay.cxx. For a couple of tasks (e.g., spell
checking, word count etc.) the SwScanner::NextWord method is used. This method
contains some code that clips the words at script type boundaries. Now in my
opinion the problem is that the SwScanner::NextWord function does not use the
ScriptInfo data structure (which contains the 'changed' script type) but rather
directly used the break iterator to find the script boundaries. What do you think?
Comment 12 alan 2009-01-06 11:44:26 UTC
@fme: Yes, that seems correct.
Comment 13 Mathias_Bauer 2009-05-07 15:45:42 UTC
Just for book keeping - is this patch still worked on? Or shall we reject it and
set the issue type to "DEFECT"?
Comment 14 alan 2009-05-07 16:48:10 UTC
If Frank is not working on this, I will try to work on it next week.
Comment 15 Mathias_Bauer 2009-05-25 16:23:30 UTC
Setting target 3.x for the time being
Comment 16 Rob Weir 2013-03-11 15:04:50 UTC
I'm adding this comment to all open issues with Issue Type == PATCH.  We have 220 such issues, many of them quite old.  I apologize for that.  

We need your help in prioritizing which patches should be integrated into our next release, Apache OpenOffice 4.0.

If you have submitted a patch and think it is applicable for AOO 4.0, please respond with a comment to let us know.

On the other hand, if the patch is no longer relevant, please let us know that as well.

If you have any general questions or want to discuss this further, please send a note to our dev mailing list:  dev@openoffice.apache.org

Thanks!

-Rob