Apache OpenOffice (AOO) Bugzilla – Issue 99796
[HE] Words with double quote marks shall not get split for Hebrew spellchecking
Last modified: 2013-02-24 20:43:23 UTC
This is a spin-off of issue 31232 "[HE] Words with double quote marks shall not get split at line end in Hebrew text". That one is OK in OOo 3.0.1.
SBA->TL/HDU: Words being split when spell checked: To be seen in attachment of issue 31232 (HE spell checker extension needed). A break iterator issue? Where (else?) to treat double quotes diffrently? Which double quotes, wich ones not?
I tested the attachment from #31232, and the problem still occurs at OpenOffice.org 3.2.1 (OOO320m19).
kaplan->SBA: double quotes in Hebrew are used to make initials or acronyms.
#103402 has a very similar problem, I think the two should be fixed together (probably the same code). Notice this is not the same problem, just a similar one.
tl->sba/kaplan: most likely a breakiterator issues (word boundaries for spell checking are defined by it). tl->kaplan: Just for the books and to avoid misunderstandings: - by double quote you mean the 0x0034 character (Ascii double quote)? Or do you have other typographical quotes in mind as well? If the latter please list the Unicode points of all of them. Taking over issue for the time being. Setting target to 3.4.
Created attachment 71142 [details] picture of the problem
the problematic symbols are: 1. period 0x0046 2. single quote 0x0039 3. double quotes 0x0034 I have attached an image that describes the problem. I don't think that there are any other symboled that can be used in the Hebrew language.
tl->eliadtsai: One more question since it makes a difference in how to handle things in the breakiterator: In the example I see that the . is used within a word and at the end of the word as well. How about this for the single and double quote? Are they only part of the word if used within a word, or would quotes at start and end need to be part of the word as well?
@tl, @eliadtsai: Hebrew has special Unicode code points for quote characters: 0x05F3 (equivalent to single quote) and 0x05F4 (equivalent to double qoute). IMHO they should be treated in the same way as their ASCII equivalents. - double quotes may be used to mark abbreviations. In that case they will be *enclosed* by Hebrew characters (no space character on either side, as shown in the screenshot) - single quotes may be used to modify certain Hebrew letters or to mark abbreviations. As opposed to the double quote case, this single quote may be also placed as the last character of the word / abbreviation (this case is not shown in eliadtsai's screenshot). So it should be very easy to determine the usage of a double quote character. If it is enclosed by Hebrew letters on both sides, it should be considered part of the word. If there is a space or punctuation character on either side, it is probably used as a quotation mark. - The usage of single quotes is more complicated: When it appears in the middle of a word, it should be considered part of the word (same as in the previous case). When it appears at the end of a word, it could be part of the word / abbreviation, or it could be used as a quotation mark.
Hi, I think this is an extremely important issue to solve for Hebrew writers, because as it stands, every acronym, abbreviation, and foreign words with the "J" sound, is marked as a mistake, with no way for the user to accept it. eliadtsai: please note that the period (.) is *not* one of the problematic symbols: only the single and double quotes (and the unicode characters "geresh" and "gershaim") are used in these cases. In hebrew it is customary to write acronyms with gershaim (double quotes), e.g., ת"ז, not with dots ת.ז. Creating acronyms with dots is a common practice in English, but NOT in Hebrew.
set target 3.x not relevant for 3.4 release