Issue 99796 - [HE] Words with double quote marks shall not get split for Hebrew spellchecking
Summary: [HE] Words with double quote marks shall not get split for Hebrew spellchecking
Alias: None
Product: General
Classification: Code
Component: spell checking (show other issues)
Version: 3.3.0 or older (OOo)
Hardware: All All
: P3 Trivial with 8 votes (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
Depends on:
Blocks: 25832
  Show dependency tree
Reported: 2009-03-02 20:40 UTC by stefan.baltzer
Modified: 2013-02-24 20:43 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---

picture of the problem (3.73 KB, image/png)
2010-08-17 10:29 UTC, eliadtsai
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description stefan.baltzer 2009-03-02 20:40:19 UTC
This is a spin-off of issue 31232 "[HE] Words with double quote marks shall not
get split at line end in Hebrew text". That one is OK in OOo 3.0.1.
Comment 1 stefan.baltzer 2009-03-02 20:48:43 UTC
SBA->TL/HDU: Words being split when spell checked: To be seen in attachment of
issue 31232 (HE spell checker extension needed).
A break iterator issue?
Where (else?) to treat double quotes diffrently?
Which double quotes, wich ones not?
Comment 2 kaplanlior 2010-08-14 16:58:17 UTC
I tested the attachment from #31232, and the problem still occurs at 3.2.1 (OOO320m19).
Comment 3 kaplanlior 2010-08-14 17:02:06 UTC
kaplan->SBA: double quotes in Hebrew are used to make initials or acronyms.
Comment 4 kaplanlior 2010-08-14 19:03:06 UTC
#103402 has a very similar problem, I think the two should be fixed together
(probably the same code). Notice this is not the same problem, just a similar one.
Comment 5 thomas.lange 2010-08-16 10:44:16 UTC
tl->sba/kaplan: most likely a breakiterator issues (word boundaries for spell
checking are defined by it).

tl->kaplan: Just for the books and to avoid misunderstandings:
- by double quote you mean the 0x0034 character (Ascii double quote)? Or do you
have other typographical quotes in mind as well? If the latter please list the
Unicode points of all of them.

Taking over issue for the time being. Setting target to 3.4.
Comment 6 eliadtsai 2010-08-17 10:29:00 UTC
Created attachment 71142 [details]
picture of the problem
Comment 7 eliadtsai 2010-08-17 10:36:18 UTC
the problematic symbols are: 
1. period 0x0046
2. single quote 0x0039
3. double quotes  0x0034 

I have attached an image that describes the problem. 
I don't think that there are any other symboled that can be used in the Hebrew
Comment 8 thomas.lange 2010-08-17 11:04:06 UTC
tl->eliadtsai: One more question since it makes a difference in how to handle
things in the breakiterator:

In the example I see that the . is used within a word and at the end of the word
as well. How about this for the single and double quote? Are they only part of
the word if used within a word, or would quotes at start and end need to be part
of the word as well?

Comment 9 hennerdrewes 2010-08-17 11:05:25 UTC
@tl, @eliadtsai: Hebrew has special Unicode code points for quote characters:
0x05F3 (equivalent to single quote) and 0x05F4 (equivalent to double qoute).

IMHO they should be treated in the same way as their ASCII equivalents. 

- double quotes may be used to mark abbreviations. In that case they will be
*enclosed* by Hebrew characters (no space character on either side, as shown in
the screenshot)

- single quotes may be used to modify certain Hebrew letters or to mark
abbreviations. As opposed to the double quote case, this single quote may be
also placed as the last character of the word / abbreviation (this case is not
shown in eliadtsai's screenshot).

So it should be very easy to determine the usage of a double quote character. If
it is enclosed by Hebrew letters on both sides, it should be considered part of
the word. If there is a space or punctuation character on either side, it is
probably used as a quotation mark.

- The usage of single quotes is more complicated: 
When it appears in the middle of a word, it should be considered part of the
word (same as in the previous case).
When it appears at the end of a word, it could be part of the word /
abbreviation, or it could be used as a quotation mark.

Comment 10 nyh 2010-11-02 12:53:58 UTC
Hi, I think this is an extremely important issue to solve for Hebrew writers,
because as it stands, every acronym, abbreviation, and foreign words with the
"J" sound, is marked as a mistake, with no way for the user to accept it.

eliadtsai: please note that the period (.) is *not* one of the problematic
symbols: only the single and double quotes (and the unicode characters "geresh"
and "gershaim") are used in these cases. In hebrew it is customary to write
acronyms with gershaim (double quotes), e.g., ת"ז, not with dots ת.ז. Creating
acronyms with dots is a common practice in English, but NOT in Hebrew.
Comment 11 Martin Hollmichel 2011-03-16 11:56:10 UTC
set target 3.x not relevant for 3.4 release